Web Scraping and Data Analysis of Concert Ticket Resales

Posted on Jun 10, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Data shows the resale of concert tickets on secondary market exchanges (an estimated $8 billion market) such as SeatGeek and StubHub is a problem for artists. Third-party brokers earn a profit from artists' live shows and in turn make tickets less affordable for real fans. The problem is exacerbated by bots, which quickly buy up tickets to popular shows for the purpose of resale.

Artists fight reselling in various ways, for example, Bruce Springsteen has used TicketMaster's Verified Fan program (which uses an algorithm to predict whether a potential buyer is a real fan), Eric Church had his team sift through sold tickets to identify scalpers and returned 25,000 on a 2017 tour, and Taylor Swift has been raising her tickets prices, squeezing reseller margins and generating more profit, but resulting in fewer sold out shows. Raising prices is not always a solution as artists have many reasons to underprice tickets, such as maximizing merchandise sales, the prestige of selling out a show and maintaining goodwill with fans.


The purpose of this project was to analyze primary and secondary market concert ticket data in order to better equip artists to make decisions on ticket pricing, concert promotion, venue selection and the number of shows to play in a city.


Using Scrapy, I captured concert data from the website of Bowery Presents, a concert promotion and venue management organization, for use as primary market data (240 observations over 6-7 months). I then scraped SeatGeek, an event ticket marketplace and aggregator, to capture resale prices for these shows (17 venues across 11 cities). Captured data included artist and opener names, city, venue, date, ticket price, and whether or not a show was sold out. I then took each of these artists and used Spotify API to get metrics on popularity, followers and genre.

To review my dataset and code:Β Github

Data Analysis

After cleaning my data and eliminating outlier concerts and venues, I began exploratory data analysis. Below I will discuss a few key findings.

In order to visualize the distribution of my data, I created a density plot of ticket price "mark up", or the percent increase in secondary market ticket prices vs. primary market. When I log transformed the data to make a clearer visualization, it showed a bimodal distribution of mark ups, with peaks that equated to about a 170% increase and approximately a 0% increase. Based on this I decided to segment the two groups and explore differences between them.

Web Scraping and Data Analysis of Concert Ticket Resales

I used a one-sided t-test to test for a difference in the mean popularity and primary ticket price and McNemar's test to test for a difference in the portion of sold out shows. The group with a little or no mark up had a lower mean in each test. With p-values below .05, I was able to reject my null hypotheses that there were no differences between the two groups based on a 95% confidence interval. This may mean that some artists with little or no mark up are overpricing their tickets (or accurately pricing them). This is an area for further exploration, as artists in this group could potentially sell out more shows by lowering ticket prices.

It seemed intuitive that concerts that have higher mark ups on tickets would feature artists that rank higher in popularity, charge a higher price and would be sold out a larger portion of the time.Β However, the relationship between mark up and primary ticket price among concerts with a meaningful mark up (eliminating concerts with little or no mark up) seemed less straightforward. Mark up appeared to loosely decrease as primary price rose.

Web Scraping and Data Analysis of Concert Ticket Resales

Data Findings

Based on this I decided to test the impact of primary price on the portion of sold out shows. I divided all shows with some mark up into quartiles based on primary price. I then compared the portion of sold out shows in the 0-50th percentiles (lower priced shows) to the portion of sold out shows in the 50-100th percentile (higher priced shows). The 0-50th percentile shows sold out a mean 8% of the time vs. 5% for the 50-100th percentile.

Using McNemar's test, I was able to reject my null hypotheses that there was no difference between the two groups based on a 95% confidence interval, with a p-value below .05. This may mean that Taylor Swift's method of increasing ticket prices to squeeze reseller margins could be effective for more artists in reducing mark up, but at the cost of selling out fewer shows.

Finally, I decided to examine the impact on the mark up of an artist's tickets if they played two shows in same city within a short time period. Artists that had two shows in the same city had a mean mark up of 123% vs. 170% for other concerts. I used a t-test to test for a difference in the mean between these two groups and was able to reject the null hypothesis of no difference based on a 90% confidence interval. This suggests that artists could reduce mark up by playing more than one show in certain cities while touring.

Next Steps

This analysis could be expanded using SeatGeek and TicketMaster API and by regularly scraping Bowery Presents. With more data, machine learning could be used to predict the mark up on artists' ticket prices in various cities and at various times of year. This would help artists more accurately set prices based on their preferences. Additionally, with knowledge of an artist's ticket sales (how many at which price, sold by which method...etc) and merchandise sales, artists could further optimize ticket prices to maximize revenue.

About Author

Bennett Gelly

Bennett is a data science fellow at NYCDSA and an MBA candidate at Columbia Business School. He is interested in machine learning-driven business strategy. Bennett brings substantial financial modeling and analytics skills from prior employment in equity research...
View all posts by Bennett Gelly >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI