Data Analysis on MVP Voting

Posted on Feb 23, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background on Our Data Analysis

With the rise of analytically focused websites such as Fangraphs and Baseball Prospectus, as well as the popularity of the book and movie Moneyball, the use of advanced data statistics has become much more mainstream. These metrics have begun to play a vital role in the roster decisions teams make and debates fans across the world have.

As the way teams and fans view and evaluate players has changed, I wanted to look into if the way writers view players when it comes time to vote for awards. At the end of each season, writers vote for an MVP in both the American and National League to recognize who was the best player. While the focus historically had been on the triple crown stats batting average, home runs, and RBI, I wanted to see if advanced stats have begun to play a larger role in these decisions.


For my analysis, I chose to scrape data from For each year there is an awards page that contains data on the MVP voting for each league from  1950 to 2019. I looked at only the hitters, focusing on the aforementioned triple crown stats for the traditional stats and war for the modern stats. Additionally, I  factored in how a team performed in a given year to see what impact that had.


League Leaders

Data Analysis on MVP Voting

I first wanted to look at how the league leader in the different metrics fared in the voting. To do this, I looked at the distribution of where the league leader in war, home runs, RBI, and average finished in the voting. The first thing that stood out is that the league leader in war historically did well in the voting, typically finishing around the top 5.  Surprisingly, though, he wasn’t winning the award all that frequently and would occasionally fall outside of the top 10.

In the last 20 years though, particularly the 2010s, the league leader finished very high in the voting, generally in the top 3 and frequently did win the award. It was also very unlikely for these players to fall out of the top 10 and even top 5 in voting. The reverse trend appeared to be true for RBI. The league leader typically finished towards the top of the voting and had a decent chance of winning the award through the 1990s but have not fared as well since the new millennium began.

Player Stats

Next, I looked at how direct the correlation was between where a player ranked in the triple crown stats and where he finished in the voting, while also accounting for team quality. The team quality benchmark used is a 90 win pace as that is generally a standard of a very good team, and the number of games teams have played over the year has varied.

What is noticeable is that there was a clear trend where the higher a player ranked in these stats, the higher they finished in the MVP voting. Team quality also played a large factor in the voting results. Players on better teams have fared better in the voting, though the importance of it has shrunk over the years. These trends were also noticeable when performing the same analysis but focusing on a player’s rank in war instead.

Player Performance 

Lastly, I wanted to see if there were any types of players that typically over or underperformed in MVP voting.

I determined a player's expected MVP finish to be where they ranked in the league in war so the high numbers are players who finished higher than their war rank, and lower numbers mean they finished lower. What I noticed is the top 10 overperformers were players who typically compiled very high home run and RBI totals, while the underperformers were typically very well-rounded players who provided a lot of value with baserunning and defense in addition to being strong hitters.

Data on Home Run

To look further into this, I  examined extreme home run and RBI seasons. To determine an extreme season I looked for players who had home run or RBIi totals >1.5 standard deviations from the league average amongst those who received  MVP for that year and were not top 10 in war.

It was very noticeable that these players did in fact perform much better in MVP voting than the war rank would indicate, particularly those on very good teams. While the group of players has not fared as well in recent years, the players are still finishing within the top ten in MVP voting, which would indicate there is a bias towards very high home run and/or rbi totals regardless of the player's overall value.

Data on the Underperformed

I also looked into extreme war seasons that weren’t paired with an elite offensive season. For this I looked at players who had a war > 1 standard deviation for league mean of MVP vote receivers who did not rank top 5 in any of the triple crown stats. It became very noticeable that these players were being undervalued in MVP voting, finishing outside of the top 10 in many cases. You can also see that in recent years players were much more likely to finish higher in the voting, indicating that the voters are weighing war more heavily in their decisions.


The conclusion I was able to draw from this is that while the traditional triple crown stats are still being heavily valued when it comes to voting for awards, there is also a clear trend towards relying more on more modern and advanced metrics. Going forward, it appears there will be more focus on a player's all around contribution, as opposed to just hitting. The value the player provides is key and also being sure that they get credit due to them even when their teammates don’t perform as well. 

About Author

Ethan Zien

Data Analyst with a background in Social Media Advertising and a strong interest in sports analytics
View all posts by Ethan Zien >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI