Web Scraping and Data Analysis of Concert Ticket Resales
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Data shows the resale of concert tickets on secondary market exchanges (an estimated $8 billion market) such as SeatGeek and StubHub is a problem for artists. Third-party brokers earn a profit from artists' live shows and in turn make tickets less affordable for real fans. The problem is exacerbated by bots, which quickly buy up tickets to popular shows for the purpose of resale.
Artists fight reselling in various ways, for example, Bruce Springsteen has used TicketMaster's Verified Fan program (which uses an algorithm to predict whether a potential buyer is a real fan), Eric Church had his team sift through sold tickets to identify scalpers and returned 25,000 on a 2017 tour, and Taylor Swift has been raising her tickets prices, squeezing reseller margins and generating more profit, but resulting in fewer sold out shows. Raising prices is not always a solution as artists have many reasons to underprice tickets, such as maximizing merchandise sales, the prestige of selling out a show and maintaining goodwill with fans.
The purpose of this project was to analyze primary and secondary market concert ticket data in order to better equip artists to make decisions on ticket pricing, concert promotion, venue selection and the number of shows to play in a city.
Using Scrapy, I captured concert data from the website of Bowery Presents, a concert promotion and venue management organization, for use as primary market data (240 observations over 6-7 months). I then scraped SeatGeek, an event ticket marketplace and aggregator, to capture resale prices for these shows (17 venues across 11 cities). Captured data included artist and opener names, city, venue, date, ticket price, and whether or not a show was sold out. I then took each of these artists and used Spotify API to get metrics on popularity, followers and genre.
To review my dataset and code: Github
After cleaning my data and eliminating outlier concerts and venues, I began exploratory data analysis. Below I will discuss a few key findings.
In order to visualize the distribution of my data, I created a density plot of ticket price "mark up", or the percent increase in secondary market ticket prices vs. primary market. When I log transformed the data to make a clearer visualization, it showed a bimodal distribution of mark ups, with peaks that equated to about a 170% increase and approximately a 0% increase. Based on this I decided to segment the two groups and explore differences between them.
I used a one-sided t-test to test for a difference in the mean popularity and primary ticket price and McNemar's test to test for a difference in the portion of sold out shows. The group with a little or no mark up had a lower mean in each test. With p-values below .05, I was able to reject my null hypotheses that there were no differences between the two groups based on a 95% confidence interval. This may mean that some artists with little or no mark up are overpricing their tickets (or accurately pricing them). This is an area for further exploration, as artists in this group could potentially sell out more shows by lowering ticket prices.
It seemed intuitive that concerts that have higher mark ups on tickets would feature artists that rank higher in popularity, charge a higher price and would be sold out a larger portion of the time. However, the relationship between mark up and primary ticket price among concerts with a meaningful mark up (eliminating concerts with little or no mark up) seemed less straightforward. Mark up appeared to loosely decrease as primary price rose.
Based on this I decided to test the impact of primary price on the portion of sold out shows. I divided all shows with some mark up into quartiles based on primary price. I then compared the portion of sold out shows in the 0-50th percentiles (lower priced shows) to the portion of sold out shows in the 50-100th percentile (higher priced shows). The 0-50th percentile shows sold out a mean 8% of the time vs. 5% for the 50-100th percentile.
Using McNemar's test, I was able to reject my null hypotheses that there was no difference between the two groups based on a 95% confidence interval, with a p-value below .05. This may mean that Taylor Swift's method of increasing ticket prices to squeeze reseller margins could be effective for more artists in reducing mark up, but at the cost of selling out fewer shows.
Finally, I decided to examine the impact on the mark up of an artist's tickets if they played two shows in same city within a short time period. Artists that had two shows in the same city had a mean mark up of 123% vs. 170% for other concerts. I used a t-test to test for a difference in the mean between these two groups and was able to reject the null hypothesis of no difference based on a 90% confidence interval. This suggests that artists could reduce mark up by playing more than one show in certain cities while touring.
This analysis could be expanded using SeatGeek and TicketMaster API and by regularly scraping Bowery Presents. With more data, machine learning could be used to predict the mark up on artists' ticket prices in various cities and at various times of year. This would help artists more accurately set prices based on their preferences. Additionally, with knowledge of an artist's ticket sales (how many at which price, sold by which method...etc) and merchandise sales, artists could further optimize ticket prices to maximize revenue.