Data Study on Baseball Strategies

Posted on Feb 1, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Emil Parikh. He is currently in the NYC Data Science Academy 12-week, full-time Data Science Bootcamp program taking place between January 9th to March 31st, 2017. This post is based on his first class project - Shiny (due on the 4th week of the program).

Links:   GitHub   |   App


Why does it matter?

Baseball is a game flooded with data and statistics, and through the mountain of data it can be easy to forget what matters. On one end of the spectrum there are irrelevant statistics; I have heard commentators say of hitters something along the lines of, “Did you know that he is the youngest player ever to hit a home run on the second day of the first week of August!?”

On the other end of the spectrum there is Paul DePodesta and his Moneyball strategy. If you have not seen the movie, basically the strategy is to identify key statistics by which to define value in order to buy undervalued players and sell overvalued players. This can be a good short-run strategy, but once the market has caught up to the definition of value, there needs to be new definitions.

The goal

What I had initially set out to do was solely compare the statistics of playoff teams to non-playoff teams to get an idea of what a long-term strategy to make the playoffs might consist of. While I found interesting numbers between playoff versus non-playoff teams, I was more fascinated by league-wide trends and the combination of findings between playoff status and league-wide trends.


A demo of the three sections of my dashboard which can also be viewed here:

Data Study on Baseball Strategies

Data Set

Sean Lahman’s Baseball Database contains a wide range of MLB data, including data on batting, pitching, and fielding. There are team-wide and individual player-level data for the regular season and playoffs, and the latest database as of this writing has data ranging from 1871 to 2015.

For the dashboard, I used the Teams.csv file which has team summary data by year, including summarized batting and pitching statistics. I used only the latest decade (2005 to 2015) of available data because, for the purposes of the dashboard, I did not think it was necessary to look further back in time unless I could not find trends in the last decade.

Dataset sample

Preview of columns from a sample of the data:

Columns added to dataset

I calculated new columns based on existing columns:

Calculated column definitions:

* In baseball, walks, hits-by-pitch, and sacrifice flies and bunts are not counted into batting averages because they are not hits. However, because they are positive outcomes they are excluded from the count of at-bats so as not to misrepresent the batting average, hence the need for OBP.

Data Results

Playoff vs. non-playoff


A key statistic I used to compare playoff and non-playoff teams is the mean of their differences of each year for a given variable. Here is the function I wrote to calculate this:

Let’s use the number of Runs Scored to go step-by-step into how the mean difference is calculated.

Here are the first six rows of Runs Scored means per year by playoff status. For example, the first row is saying that in 2005, teams that did not make the playoffs scored an average of 730.7 runs:

The differences of means (playoff minus non-playoff) per year from the previous table:

The mean of the differences of Runs Scored from 2005-2015:

Data Study on Baseball Strategies

What this tells us is that, on average from 2005-2015, playoff teams are scoring 69.338 more runs per year than non-playoff teams. Not only are they scoring more runs, they are scoring more runs every year as you can see here in the difference between the blue and red bars:

Runs Scored - Playoff vs. Non-playoff Teams


Similarly to Runs Scored, Hits and Walks topped the batting statistics, while Hits Allowed and Walks Allowed topped the pitching statistics favoring playoff teams.

Okay, but this is expected, right? Shouldn’t the teams that can score more runs and get more hits win? Not so fast!

Examining these correlation matrices of hitting statistics (all insignificant correlations are set to 0)…

Playoff teams:

Data Study on Baseball Strategies

Non-playoff teams:

Non-Playoff Team Hitting Variable Correlations

…it appears that:

  • Runs (R) to Win Percent: Playoff teams have a weak positive correlation of Runs to Win Percent (0.25), while non-playoff teams have a moderate positive correlation (.40). We have seen previously that Playoff teams are scoring more runs overall. The low correlation of Runs to Win Percent could potentially be a sign that it’s not the number of runs but the efficiency of runs that sets playoff teams apart.
  • Home runs (HR) to Runs and Walks (BB) to Runs: Playoff teams have a moderate positive correlation of HR to Runs (0.55), while non-playoff teams of a strong positive correlation (0.67). The reverse is true of Walks to Runs. Playoff teams are walking more and hitting more home runs as seen previously but potentially are spreading out their sources of runs more effectively. This is relevant so that if a few of their players go cold, others can pick them up. I’ve seen teams rely on their home run hitters, and if they went cold, the team went cold.

League-wide trends

Not only can the previous correlation observations be useful to keep in mind when acquiring players, they can also be useful in setting rosters and lineups. However, the league-wide trends in the MLB caught my eye more than the differences between teams of varying playoff status.

I saw that runs and hits were down:

Runs Scored - Playoff vs. Non-playoff Teams

Hits - Playoff vs. Non-playoff Teams

…so I thought “Okay, it probably has a lot to do with the performance enhancing drug crackdown”. Then I saw the home run numbers, and it looks like home runs are not significantly trending one way or another. They have their ups and downs. 2014 just happened to be a low year.

Home Runs - Playoff vs. Non-playoff Teams


Finally, I saw the strikeouts:

Strikeouts - Playoff vs. Non-playoff Teams

Walk Numbers

and walks numbers:

Walks - Playoff vs. Non-playoff Teams



Closing remarks

Especially at a time in baseball when the offensive numbers are down and the defensive side has the upper hand, every scoring opportunity must be taken and sources of runs must be varied. Every opportunity missed is another potential loss, and the opportunities are only getting fewer. The teams who make the playoffs may just be best at this. However, efficiency and varied source of runs might not be the only solution. The defense has adjusted; can the offense adjust back?

Next steps

I made some remarks based on eyeballing of plots, but data science is more than that. Upon further reading and given more time, I would:

  • run significance tests to compare the difference in mean differences and variable correlations
  • further divide the playoff and non-playoff teams. I would compare the bottom 3 to 5 playoff teams to the top 3 to 5 non-playoff teams
  • do a deeper player-level analysis on the league-wide trends to see what I can uncover

About Author

Emil Parikh

Data Scientist with professional experience in web scraping, predictive modeling, data visualization, and big data with intensive software development experience. Strength in interpreting and converting business needs into solutions. Quick learner and thorough planner with a passion for...
View all posts by Emil Parikh >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI