Predicting the Baseball Hall of Fame

Andrew Rubino
Posted on May 15, 2017


The Great Bambino. The Big Unit. Joltin' Joe. Henry Rowengartner. If you're familiar with the sport of baseball, you might recognize some of these names from real life or the movies. Since baseball has been engrained in the fabric of America for almost 200 years, and since it is my favorite sport, I decided that I thought it might be fun to take a look back at some of the best players to ever play the game to see how modern day players stack up against them.

The Motivation

Sports analytics have progressed dramatically in recent years. With the wealth of data available for Major League Baseball, many teams are employing analytics departments to extract value from their statistics. I decided to scrape the hall of fame players on to investigate these statistics and determine how good a player has to be in order to be inducted into the hall of fame. Additionally, I took a sample of data from players that have played since 1989 in order to predict whether or not they might be eligible to make the hall of fame.

Extracting the Data

The parent URL I used to extract Hall of Fame player statistics was on Baseball Reference, a baseball database that has all the baseball statistics one could ever want. I had to write two separate spiders to take into account the different statistics used to measure a batter's statistical output and a pitcher's statistical output. All in all, there are 163 batters in the Baseball Hall of Fame, which translates to a file of roughly 3500 rows (including all their seasons played). There are 77 pitchers in the hall of fame, which translates to a file of about 1600 rows (includes all their seasons played). The data for more recent players was downloaded and filtered to include only batters that had over 500 plate appearances per year, and pitchers who pitched over 150 innings per year in order to normalize numbers.


First, I took a look at batters. I wanted to get a sense of the distribution of number home runs are hit per season for hall of fame batters, as well as number of hits per season. The histograms look like this:



Hitting a high number of home runs don't appear to be a huge indication of making it to the hall of fame, although it probably doesn't hurt. We do see that on average, there are roughly 160 games hit per year most frequently, which translates to about one hit per game. A batter who bats .300 for a season or fails to get a hit 70% of their at bats is considered a great hitter. I also plotted the total number of strikeouts against number of walks, and noticed that many hall of fame hitters had tremendous plate discipline, indicated by walks being greater than strikeouts.


Finally, I plotted the batter based on OPS, or On Base Percentage plus Slugging Percentage, a general statistic to measure the overall value of a player. Suffice to say, I was not surprised to see Babe Ruth at the top of that list.

For pitchers, I mostly wanted to see their performance in terms of outcomes they could control, which are home runs allowed, strikeouts, and walks. I plotted the strikeouts per nine innings against the walks per nine innings. We see an interesting linear trend in which the higher the strikeout numbers, the higher the walks. This could say something about the pitcher’s ability to spin and curve the baseball in order to deceive hitters, which might lead to a lot of strikeouts but also less control.


In order to measure overall value of a pitcher, I measured their FIP, or fielding independent pitching. This takes into account number of home runs allowed, number of strikeouts, number of walks, and a constant number to normalize the statistic. The lower the FIP, the better.

According to FIP, Ed Walsh is the best pitchers in the hall of fame, followed by the more contemporary Pedro Martinez.


Predicting Hall of Famers

Since I had the hall of fame statistics, I figure that I could use them as a baseline and try to fit a logistic regression model that would take data for more recent players and predict whether or not they would be included among the players immortalized there. I combined my hall of fame data with the separate subset of more recent players and then used cross-validation to train the model. Then, I predicted and tested it on the test set.

For predicting players, I used backwards selection and tested the correlation between each variable to make sure there was as little multicollinearity as possible. I then created a binary variable to categorized players that are in the hall of fame versus those who aren't. For batters, my best model had hits per season as the most significant predictor (the higher, the better), followed by overall strikeout rate (the lower, the better). The accuracy of this model was 97%, which beat the baseline of 83% by a wide margin. The accuracy was very high and variance was very low, though I definitely could have used more data to obtain a better model.

For pitchers, my best predictor was walks and hits per inning (the fewer, the better) and home runs per 9 innings (the fewer, the better). FIP also was included in the model, though it might have not been necessary, as it is essentially a total measure of home runs, walks, and strikeouts. The accuracy of this model was 95%, which beat the baseline model of 90%. Again, this was highly biased due to lack of data and cross-validating my model.

Further Improvements

There are a few things worth noting for the model:

  • I only used one training set and one test set, and could have used a higher K value to cross-validate the model.
  • The data is highly biased, as I could have grabbed more data for non hall of fame players, but did not due to time constraints.
  • We could use more advanced statistics to more accurately predict, but I used the data from the statistics I scraped.


With more statistics, we should be able to create a model that will knock it out of the park (I'm sorry).

About Author

Andrew Rubino

Andrew Rubino

Andrew graduated from UC Santa Barbara with a degree in English and Statistics. He previously worked at an adtech company where he learned the ins and outs of cleaning, transforming, and reporting on big data using a variety...
Read more

Leave Responses

Your email address will not be published. Required fields are marked *

final fantasy xv a new empire hack November 19, 2017
Get pleasure from Brotherhood Episode 1 now!
Filomena04 November 12, 2017
I think your website needs some fresh posts. Writing manually takes a lot of time, but there is tool for this boring task, search for: Boorfe's tips unlimited content
replique ballon blanc cartier August 30, 2017
That is quite quite. I like things plenty. The actual estimate was cute and also container covered in awesome bow. It isn't one particular great bracelet ever before, however for the cost they exceeds our expectations. Can'tdelay in order to award this at the mom of Mother's time! replique ballon blanc cartier August 26, 2017
I got this website from my friend who shared with me concerning this website and at the moment this time I am visiting this site and reading very informative posts here.
seo plugin July 25, 2017
Hello Web Admin, I noticed that your On-Page SEO is is missing a few factors, for one you do not use all three H tags in your post, also I notice that you are not using bold or italics properly in your SEO optimization. On-Page SEO means more now than ever since the new Google update: Panda. No longer are backlinks and simply pinging or sending out a RSS feed the key to getting Google PageRank or Alexa Rankings, You now NEED On-Page SEO. So what is good On-Page SEO?First your keyword must appear in the title.Then it must appear in the URL.You have to optimize your keyword and make sure that it has a nice keyword density of 3-5% in your article with relevant LSI (Latent Semantic Indexing). Then you should spread all H1,H2,H3 tags in your article.Your Keyword should appear in your first paragraph and in the last sentence of the page. You should have relevant usage of Bold and italics of your keyword.There should be one internal link to a page on your blog and you should have one image with an alt tag that has your keyword....wait there's even more Now what if i told you there was a simple Wordpress plugin that does all the On-Page SEO, and automatically for you? That's right AUTOMATICALLY, just watch this 4minute video for more information at. Seo Plugin
Predicting the Baseball Hall of Fame – Mubashir Qasim May 15, 2017
[…] article was first published on R – NYC Data Science Academy Blog, and kindly contributed to […]