NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > Take Me Out to the Ballgame: Predicting MLB Attendance (with Web Scraping & Shiny)

Take Me Out to the Ballgame: Predicting MLB Attendance (with Web Scraping & Shiny)

Michael Todisco
Posted on Mar 22, 2016

Contributed by Michael Todisco. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his thirds class project - Web scraping and Shiny (due on the 6th week of the program).

Michael Todisco

Overview

Is there anything better than going to a baseball game?  Provided it's during the day, sunny, above seventy degrees,  on a Saturday, and there's a free hat giveaway for the first 30k fans.  The point being that there are many factors that go into a fans decision to show up to the ballpark.  My goal for this project, was to explore these variables with historical attendance data and make a prediction on the game-to-game attendance totals for the New York Yankees upcoming season.

Getting the Data

The majority of the data was very easy to obtain using Baseball-Reference's Play Index tool.  The site has robust data that dates back to the 1800's.  However, one aspect of the project that I was emphatic on including was promotional data.  This refers to a teams pre-game promotions such as a hat or bobble-head giveaway.  This data is not tracked on Baseball-Reference.  The only place I could find it was on the New York Yankees official website and it only dated back to 2009.  Furthermore, the data could not be downloaded, but instead I would have to scrape the calendars.

Web Scraping

The promotional data I needed was in the form of calendars.  I decided to use Python and Beautiful Soup to do the scraping and after inspecting the elements of the website, it didn't seem like  it would be too much trouble to gather the information.

Screen Shot 2016-03-21 at 9.24.00 PM

However, I ran into an immediate wall when I realized the calendars were interactive and written with Javascript.  After a lot of searching for solutions,  I found Dryscrape, which allowed me to access the text.  Once I had that in place, it only took a few simple 'for' loops and appending to lists.  Finally, I wrote the scraped data to a .csv file.

Untitled

Preparing the Data

Once I had my scraped promotional data and the main dataset from Baseball-Reference, I decided to load them into R.

nyy_scraped = read.csv('nyy.csv')
yankees = read.csv('Yankees.csv')

There were a few column manipulations and additions that I made.  Below is one example, which adds in the opponent's league column.
###Adding Column For Opponent's League###
AL = c('Baltimore Orioles', 'Boston Red Sox', 'Chicago White Sox', 
       'Cleveland Indians', 'Detroit Tigers', 'Houston Astros', 
       'Kansas City Royals', 'Los Angeles Angels', 'Minnesota Twins', 
       'New York Yankees', 'Oakland Athletics', 'Seattle Mariners', 
       'Tampa Bay Rays', 'Texas Rangers', 'Toronto Blue Jays')

league_func = function(x){
  if(x %in% AL) return('American League')
  else
    return('National League')
}

yankees$Opp_league = sapply(yankees$Opp, league_func)

The main step I needed to do with the data was to merge the scraped dataset with the data I downloaded from baseball reference.  I did this using 'Date' as the key.

yankees = merge(yankees, nyy_scraped, by = 'Date', all.x = TRUE)

Here's a shot of some of the data in the app.

Screen Shot 2016-03-21 at 10.01.37 PM

Building the Shiny App

With the data collected and prepared, my next step was to build the interactive tool using Shiny.  With Shiny, it is necessary to create a front-end UI.R file, along with a back-end Server.R file.  The two working together provide the interactive functionality.  Below are a small snapshot of each, but the full code can be found here.

UI.R                                                                                                                    Server.R

Screen Shot 2016-03-21 at 9.53.28 PMScreen Shot 2016-03-21 at 9.55.08 PM

Here is a shot of the main tab/graph of the Shiny App.

Screen Shot 2016-03-21 at 10.07.51 PM

The functionality and features that I built into the Shiny App were the following:

  • Attendance Visualizations by:
    • Season
    • Opponent
    • Day of the Week
    • Month
    • Particular game
  • Visualizations can be filtered by:
    • Game time (day v. night)
    • Temperature
    • Weather
    • Promotion (yes v. no)

The graphs and value boxes - average, maximum and minimum attendance - update as each filter is changed.

Attendance Analysis

The best thing about Shiny and its interactive functionality is that there are endless amount of filter combinations to filter within the tool.  In addition, a user can freely gather insights on overarching trends or get information as granular as a single game.  The average attendance for games against the Orioles, on Tuesdays nights, with a promotion running and temperatures in the mid 50's, is only a few clicks away.

Here are a few of the high-level insights that I was able to take away.  Some are fairly straight forward and common sense, but others are quite interesting.

  • Day games have, on average, higher attendance numbers than night games.  For example, in 2013 average attendance for day games was 42k and for night games it was 39k.  This was surprising to me, but I figured out that it was largely due to the next point.
  • Weekend games bring in more fans than weekday games.  This makes sense since people typically have off work and to the first point, most of the Saturday and Sunday games take place during the day.
  • Temperature and weather have little effect on the attendance numbers, especially when the Yankees have a good record and are in the playoff-hunt towards the later part of the season, when weather tends to be less friendly to fans.
  • The Yankees ran promotions for roughly 30-40% of their home games.  Last year promotions did show an uptick of about 2k extra attendees compared to games when promotions were not run.
  • Total attendance over the entire season has been steadily declining since 2009, where total attendance was 3.8 million compared to 2015 at 3.2 million.

Prediction Model

The MLB 2016 schedule is of course already out and the Yankees have also released their promotional calendar for the upcoming season.  Using this information and the past data that I was able to build the Shiny App, I was able to fit a model to predict attendance numbers for the 2016 season.

I used the below simple multiple linear regression model to regress attendance on to six variables.

train.model = lm(Attendance ~ Opening_Day + Month + DOW + DayNight + Opp + Promo, data = yankees.train)

Because this is a linear regression model, it was necessary to graphically check that the model is not violating linear assumptions such as Linearity, Normality, Constant Variance, and Independent Errors. I also check outliers and influence points.

UntitledUntitled

UntitledUntitled

Untitled

The graphs aren't perfect and there is definitely some areas to be concerned with, but for the most part I can accept this model as adhering to the assumptions of linear regression.

Cross-Validation with training and test sets for the model was performed with the following code:

#Train and test set
set.seed(0)
training_test = c(rep(1, length = trunc((2/3) * nrow(yankees))),
                  rep(2, length = (nrow(yankees) - trunc((2/3) * nrow(yankees)))))
yankees$training_test = sample(training_test) #random permutation
yankees$training_test = factor(yankees$training_test,
                               levels = c(1,2), labels = c("TRAIN", "TEST"))
yankees.train = subset(yankees, training_test == 'TRAIN')
yankees.test = subset(yankees, training_test == 'TEST' & Month != 'March')

Here is a visual of the accuracy of the model on the training and test sets:

Untitled

I graphed the models y-variable (Attendance) on a season long plot in the Shiny App on its separate tab.  A user can select a specific game or a range of games to view their attendance.

Screen Shot 2016-03-21 at 10.43.10 PM

There is a lot about the model and predicted attendance that looks promising.  The highest predicted games for the yankees occur on Saturday day games against the rival Boston Red Sox and Baltimore Orioles.  Games with low estimated attendance occur on Tuesday nights and do not involve a promotion.   However, there will undoubtedly be error in the regression and it will need to be tuned moving forward.

Conclusion

I was very pleased with how this project turned out.  I combined a myriad of aspects of data science that I have learned in this bootcamp; R, python, web scraping, machine learning and Shiny.  Shiny is a really useful and well designed tool that I will continue to utilize in my projects and work moving forward.

Next steps for this project include scraping the rest of the MLB promotional  calendars so that I can include every team and not just the Yankees.  I also wan to train more accurate/complex models to predict the attendance numbers for the 2016 season.  While multiple linear regression is easy to understand and interpret, I believe a more sophisticated supervised learning technique will yield better results.

About Author

Michael Todisco

Michael has a B.A. in economics from Johns Hopkins University. Over the last four years he has been in the professional world, working at two NYC start-up companies. First, at JackThreads where he was a data analyst for...
View all posts by Michael Todisco >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Google July 28, 2021
Google The information and facts mentioned within the post are a few of the top offered.
Google July 21, 2021
Google One of our visitors not long ago suggested the following website.
Google June 6, 2020
Google We like to honor several other online websites around the net, even when they arenย’t linked to us, by linking to them. Underneath are some webpages really worth checking out.
Google September 19, 2019
Google The time to read or stop by the subject material or sites we have linked to beneath.
falldetail3.blogminds.com November 28, 2017
Very good post. I certainly love this website. Stick with it!
naughty allie October 5, 2017
I'm no longer sure the place you're getting your info, but great topic. I must spend a while finding out much more or understanding more. Thanks for wonderful info I was on the lookout for this info for my mission.
bracciale oro cartier imitazione September 19, 2017
HR ะบะพะผะฟะฐะฝะธะธ ะฝะตะบะฒะฐะปะธั„ะธั†ะธั€ะพะฒะฐะฝะฝะฐั ะฑะตะทะดะฐั€ัŒ ะธ ั…ะฐะผะปะพโ€ฆ ะบะพะปะปะตะบั‚ะธะฒั‡ะธะบ โ€” ัะปะฐะฑะตะฝัŒะบะธะน, ะฝะตั‚ ะพะฟั‹ั‚ะฝั‹ั… ัะฟะตั†ะธะฐะปะธัั‚ะพะฒโ€ฆ ัƒะฒั‹((( bracciale oro cartier imitazione http://www.popularllove.com/it/classic-aaa-replica-cartier-love-bracelet-yellow-gold-mosaic-four-diamonds-p756/
The Guide to Sports Data - Ergo Sum July 26, 2017
[โ€ฆ] MLB Weather example [โ€ฆ]

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application