Web Scraping and Analysis of Concert Ticket Resales

Posted on Jun 10, 2018


The resale of concert tickets on secondary market exchanges (an estimated $8 billion market) such as SeatGeek and StubHub is a problem for artists. Third-party brokers earn a profit from artists' live shows and in turn make tickets less affordable for real fans. The problem is exacerbated by bots, which quickly buy up tickets to popular shows for the purpose of resale. Artists fight reselling in various ways, for example, Bruce Springsteen has used TicketMaster's Verified Fan program (which uses an algorithm to predict whether a potential buyer is a real fan), Eric Church had his team sift through sold tickets to identify scalpers and returned 25,000 on a 2017 tour, and Taylor Swift has been raising her tickets prices, squeezing reseller margins and generating more profit, but resulting in fewer sold out shows. Raising prices is not always a solution as artists have many reasons to underprice tickets, such as maximizing merchandise sales, the prestige of selling out a show and maintaining goodwill with fans.


The purpose of this project was to analyze primary and secondary market concert ticket data in order to better equip artists to make decisions on ticket pricing, concert promotion, venue selection and the number of shows to play in a city.


Using Scrapy, I captured concert data from the website of Bowery Presents, a concert promotion and venue management organization, for use as primary market data (240 observations over 6-7 months). I then scraped SeatGeek, an event ticket marketplace and aggregator, to capture resale prices for these shows (17 venues across 11 cities). Captured data included artist and opener names, city, venue, date, ticket price, and whether or not a show was sold out. I then took each of these artists and used Spotify API to get metrics on popularity, followers and genre.

To review my dataset and code:Β Github


After cleaning my data and eliminating outlier concerts and venues, I began exploratory data analysis. Below I will discuss a few key findings.

In order to visualize the distribution of my data, I created a density plot of ticket price "mark up", or the percent increase in secondary market ticket prices vs. primary market. When I log transformed the data to make a clearer visualization, it showed a bimodal distribution of mark ups, with peaks that equated to about a 170% increase and approximately a 0% increase. Based on this I decided to segment the two groups and explore differences between them.

I used a one-sided t-test to test for a difference in the mean popularity and primary ticket price and McNemar's test to test for a difference in the portion of sold out shows. The group with a little or no mark up had a lower mean in each test. With p-values below .05, I was able to reject my null hypotheses that there were no differences between the two groups based on a 95% confidence interval. This may mean that some artists with little or no mark up are overpricing their tickets (or accurately pricing them). This is an area for further exploration, as artists in this group could potentially sell out more shows by lowering ticket prices.

It seemed intuitive that concerts that have higher mark ups on tickets would feature artists that rank higher in popularity, charge a higher price and would be sold out a larger portion of the time.Β However, the relationship between mark up and primary ticket price among concerts with a meaningful mark up (eliminating concerts with little or no mark up) seemed less straightforward. Mark up appeared to loosely decrease as primary price rose.

Based on this I decided to test the impact of primary price on the portion of sold out shows. I divided all shows with some mark up into quartiles based on primary price. I then compared the portion of sold out shows in the 0-50th percentiles (lower priced shows) to the portion of sold out shows in the 50-100th percentile (higher priced shows). The 0-50th percentile shows sold out a mean 8% of the time vs. 5% for the 50-100th percentile. Using McNemar's test, I was able to reject my null hypotheses that there was no difference between the two groups based on a 95% confidence interval, with a p-value below .05. This may mean that Taylor Swift's method of increasing ticket prices to squeeze reseller margins could be effective for more artists in reducing mark up, but at the cost of selling out fewer shows.

Finally, I decided to examine the impact on the mark up of an artist's tickets if they played two shows in same city within a short time period. Artists that had two shows in the same city had a mean mark up of 123% vs. 170% for other concerts. I used a t-test to test for a difference in the mean between these two groups and was able to reject the null hypothesis of no difference based on a 90% confidence interval. This suggests that artists could reduce mark up by playing more than one show in certain cities while touring.

Next Steps

This analysis could be expanded using SeatGeek and TicketMaster API and by regularly scraping Bowery Presents. With more data, machine learning could be used to predict the mark up on artists' ticket prices in various cities and at various times of year. This would help artists more accurately set prices based on their preferences. Additionally, with knowledge of an artist's ticket sales (how many at which price, sold by which method...etc) and merchandise sales, artists could further optimize ticket prices to maximize revenue.

About Author

Bennett Gelly

Bennett is a data science fellow at NYCDSA and an MBA candidate at Columbia Business School. He is interested in machine learning-driven business strategy. Bennett brings substantial financial modeling and analytics skills from prior employment in equity research...
View all posts by Bennett Gelly >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp