Take Me Out to the Ballgame: Predicting MLB Attendance (with Web Scraping & Shiny)

Michael Todisco
Posted on Mar 22, 2016

Contributed by Michael Todisco. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his thirds class project - Web scraping and Shiny (due on the 6th week of the program).

Michael Todisco

Overview

Is there anything better than going to a baseball game?  Provided it's during the day, sunny, above seventy degrees,  on a Saturday, and there's a free hat giveaway for the first 30k fans.  The point being that there are many factors that go into a fans decision to show up to the ballpark.  My goal for this project, was to explore these variables with historical attendance data and make a prediction on the game-to-game attendance totals for the New York Yankees upcoming season.

Getting the Data

The majority of the data was very easy to obtain using Baseball-Reference's Play Index tool.  The site has robust data that dates back to the 1800's.  However, one aspect of the project that I was emphatic on including was promotional data.  This refers to a teams pre-game promotions such as a hat or bobble-head giveaway.  This data is not tracked on Baseball-Reference.  The only place I could find it was on the New York Yankees official website and it only dated back to 2009.  Furthermore, the data could not be downloaded, but instead I would have to scrape the calendars.

Web Scraping

The promotional data I needed was in the form of calendars.  I decided to use Python and Beautiful Soup to do the scraping and after inspecting the elements of the website, it didn't seem like  it would be too much trouble to gather the information.

Screen Shot 2016-03-21 at 9.24.00 PM

However, I ran into an immediate wall when I realized the calendars were interactive and written with Javascript.  After a lot of searching for solutions,  I found Dryscrape, which allowed me to access the text.  Once I had that in place, it only took a few simple 'for' loops and appending to lists.  Finally, I wrote the scraped data to a .csv file.

Untitled

Preparing the Data

Once I had my scraped promotional data and the main dataset from Baseball-Reference, I decided to load them into R.

nyy_scraped = read.csv('nyy.csv')
yankees = read.csv('Yankees.csv')

There were a few column manipulations and additions that I made.  Below is one example, which adds in the opponent's league column.
###Adding Column For Opponent's League###
AL = c('Baltimore Orioles', 'Boston Red Sox', 'Chicago White Sox', 
       'Cleveland Indians', 'Detroit Tigers', 'Houston Astros', 
       'Kansas City Royals', 'Los Angeles Angels', 'Minnesota Twins', 
       'New York Yankees', 'Oakland Athletics', 'Seattle Mariners', 
       'Tampa Bay Rays', 'Texas Rangers', 'Toronto Blue Jays')

league_func = function(x){
  if(x %in% AL) return('American League')
  else
    return('National League')
}

yankees$Opp_league = sapply(yankees$Opp, league_func)

The main step I needed to do with the data was to merge the scraped dataset with the data I downloaded from baseball reference.  I did this using 'Date' as the key.

yankees = merge(yankees, nyy_scraped, by = 'Date', all.x = TRUE)

Here's a shot of some of the data in the app.

Screen Shot 2016-03-21 at 10.01.37 PM

Building the Shiny App

With the data collected and prepared, my next step was to build the interactive tool using Shiny.  With Shiny, it is necessary to create a front-end UI.R file, along with a back-end Server.R file.  The two working together provide the interactive functionality.  Below are a small snapshot of each, but the full code can be found here.

UI.R                                                                                                                    Server.R

Screen Shot 2016-03-21 at 9.53.28 PMScreen Shot 2016-03-21 at 9.55.08 PM

Here is a shot of the main tab/graph of the Shiny App.

Screen Shot 2016-03-21 at 10.07.51 PM

The functionality and features that I built into the Shiny App were the following:

  • Attendance Visualizations by:
    • Season
    • Opponent
    • Day of the Week
    • Month
    • Particular game
  • Visualizations can be filtered by:
    • Game time (day v. night)
    • Temperature
    • Weather
    • Promotion (yes v. no)

The graphs and value boxes - average, maximum and minimum attendance - update as each filter is changed.

Attendance Analysis

The best thing about Shiny and its interactive functionality is that there are endless amount of filter combinations to filter within the tool.  In addition, a user can freely gather insights on overarching trends or get information as granular as a single game.  The average attendance for games against the Orioles, on Tuesdays nights, with a promotion running and temperatures in the mid 50's, is only a few clicks away.

Here are a few of the high-level insights that I was able to take away.  Some are fairly straight forward and common sense, but others are quite interesting.

  • Day games have, on average, higher attendance numbers than night games.  For example, in 2013 average attendance for day games was 42k and for night games it was 39k.  This was surprising to me, but I figured out that it was largely due to the next point.
  • Weekend games bring in more fans than weekday games.  This makes sense since people typically have off work and to the first point, most of the Saturday and Sunday games take place during the day.
  • Temperature and weather have little effect on the attendance numbers, especially when the Yankees have a good record and are in the playoff-hunt towards the later part of the season, when weather tends to be less friendly to fans.
  • The Yankees ran promotions for roughly 30-40% of their home games.  Last year promotions did show an uptick of about 2k extra attendees compared to games when promotions were not run.
  • Total attendance over the entire season has been steadily declining since 2009, where total attendance was 3.8 million compared to 2015 at 3.2 million.

Prediction Model

The MLB 2016 schedule is of course already out and the Yankees have also released their promotional calendar for the upcoming season.  Using this information and the past data that I was able to build the Shiny App, I was able to fit a model to predict attendance numbers for the 2016 season.

I used the below simple multiple linear regression model to regress attendance on to six variables.

train.model = lm(Attendance ~ Opening_Day + Month + DOW + DayNight + Opp + Promo, data = yankees.train)

Because this is a linear regression model, it was necessary to graphically check that the model is not violating linear assumptions such as Linearity, Normality, Constant Variance, and Independent Errors. I also check outliers and influence points.

UntitledUntitled

UntitledUntitled

Untitled

The graphs aren't perfect and there is definitely some areas to be concerned with, but for the most part I can accept this model as adhering to the assumptions of linear regression.

Cross-Validation with training and test sets for the model was performed with the following code:

#Train and test set
set.seed(0)
training_test = c(rep(1, length = trunc((2/3) * nrow(yankees))),
                  rep(2, length = (nrow(yankees) - trunc((2/3) * nrow(yankees)))))
yankees$training_test = sample(training_test) #random permutation
yankees$training_test = factor(yankees$training_test,
                               levels = c(1,2), labels = c("TRAIN", "TEST"))
yankees.train = subset(yankees, training_test == 'TRAIN')
yankees.test = subset(yankees, training_test == 'TEST' & Month != 'March')

Here is a visual of the accuracy of the model on the training and test sets:

Untitled

I graphed the models y-variable (Attendance) on a season long plot in the Shiny App on its separate tab.  A user can select a specific game or a range of games to view their attendance.

Screen Shot 2016-03-21 at 10.43.10 PM

There is a lot about the model and predicted attendance that looks promising.  The highest predicted games for the yankees occur on Saturday day games against the rival Boston Red Sox and Baltimore Orioles.  Games with low estimated attendance occur on Tuesday nights and do not involve a promotion.   However, there will undoubtedly be error in the regression and it will need to be tuned moving forward.

Conclusion

I was very pleased with how this project turned out.  I combined a myriad of aspects of data science that I have learned in this bootcamp; R, python, web scraping, machine learning and Shiny.  Shiny is a really useful and well designed tool that I will continue to utilize in my projects and work moving forward.

Next steps for this project include scraping the rest of the MLB promotional  calendars so that I can include every team and not just the Yankees.  I also wan to train more accurate/complex models to predict the attendance numbers for the 2016 season.  While multiple linear regression is easy to understand and interpret, I believe a more sophisticated supervised learning technique will yield better results.

About Author

Michael Todisco

Michael Todisco

Michael has a B.A. in economics from Johns Hopkins University. Over the last four years he has been in the professional world, working at two NYC start-up companies. First, at JackThreads where he was a data analyst for...
View all posts by Michael Todisco >

Leave a Comment

Avatar
Google June 6, 2020
Google We like to honor several other online websites around the net, even when they aren’t linked to us, by linking to them. Underneath are some webpages really worth checking out.
Avatar
Google September 19, 2019
Google The time to read or stop by the subject material or sites we have linked to beneath.
Avatar
falldetail3.blogminds.com November 28, 2017
Very good post. I certainly love this website. Stick with it!
Avatar
naughty allie October 5, 2017
I'm no longer sure the place you're getting your info, but great topic. I must spend a while finding out much more or understanding more. Thanks for wonderful info I was on the lookout for this info for my mission.
Avatar
bracciale oro cartier imitazione September 19, 2017
HR компании неквалифицированная бездарь и хамло… коллективчик — слабенький, нет опытных специалистов… увы((( bracciale oro cartier imitazione http://www.popularllove.com/it/classic-aaa-replica-cartier-love-bracelet-yellow-gold-mosaic-four-diamonds-p756/
Avatar
The Guide to Sports Data - Ergo Sum July 26, 2017
[…] MLB Weather example […]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp