Visualizing Data In College Basketball
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Background and Motivation
Every year, 68 college basketball teams get a chance to compete in a tournament known by many as March Madness. One of the exciting things about this tournament, aside from the action of good basketball games, is filling in your bracket and predicting how far various teams in the tournament will advance. Some people take this part more seriously than others. Many just select their alma mater as the winner. Others just choose the team according its colors or mascot. But for those who take it seriously, there is a lot more data involved in picking a winner.
Sports books, for example, have to accurately predict not only the winners and losers of certain matchups, but also how much a certain team will win by, along with the total score of both teams, in order to make profits. Serious gamblers, who are risking their own money, also have to make smart and educated predictions in order to get back a net gain.
How can we make accurate predictions? What are the characteristics of teams that tend to advance far into the tournament? How helpful can past results in games before the tournament be in making predictions. How helpful can results from previous tournaments be in making predictions?
For this project, I took numerous datasets containing basic box score statistics of regular season and tournament games from as early as 2003 from Kaggle.com. I did all sorts of data preprocessing, from reshaping the data to merging multiple datasets together. All of the code for these data manipulations can be found on my github.
I made various plots analyzing the effectiveness of certain statistics in measuring how good a team is, as well as how well these statistics can predict tournament success. All of these data manipulations and visualizations culminate in very basic predictive model that provides a predicted game score for a user's choice of two teams.
Data on The Interactive Application
The application is broken down into six different sections. Each offers some level of interactivity to enable user's to discover their own trends and insights.
This section of the app allows the user to select a certain game statistic. The two displayed boxplots represent the distribution of that game statistic for the winning teams and the losing teams for all regular season games dating back to 2010. For example, below we have the distribution of the number of assists for the winning teams and the losing teams. This plot tells us that assists tend to be higher, on average, for winning teams.
Strength of Schedule
This section of the app shows the strength of schedule, along with the strength of each conference, for the 2021 season. A team's strength of schedule was calculated by taking the average adjusted efficiency margin of all their opponents for the 2021 season. The strength of conference was calculated by taking the adjusted efficiency margin of a conferences' out of conference games. More information on adjusted stats is explained below.
Looking at the plot above, we see the teams with the top 25 most difficult schedules. Kentucky appears to have had the most difficult schedule of all teams for the 2021 season.
Looking at the plot above, we see the strongest conferences for the 2021 season. The big ten and the big twelve appear to be the strongest conferences for the 2021 season.
Adjusted Data Stats
This section of the app shows that adjusting a team's stats based on the competition level of their opponents gives us a much better indication for how good a team actually is. To adjust a specific team statistic in the game box score, I take that statistic and subtract it by the difference of the opposing team's season average and the national average in the opposing stat.
For example, we can look at the box score of a game between Team A and Team B. Team A had an offensive efficiency (Points/Possessions*100) of 60, and TeamB had an average season defensive efficiency (PointsAllowed/Possessions) of 50. The national season average for defensive efficiency is 55, and the adjusted offensive efficiency would be 60 - (50 - 55). = 65. TeamA would essentially be rewarded because they played a defense that is above average. In other words, that statistic is adjusted upwards.
The plot above shows us the teams with the highest offensive efficiency for the 2020 college basketball season. Anyone who followed the 2020 college basketball season would know that schools like St. Mary's CA, North Florida, and S Dakota St. probably were not one of the top ten offenses in college basketball. Once we adjust for the strength of a team's opponents, we get the top ten offenses for the 2020 season shown below. All those previously mentioned schools fall out of the top ten.
This section also contains an interactive plot that shows the relationship between a team's season average stat and the corresponding adjusted stat for the 2020 regular season. Each point in the plot represents a team in the 2020 season.
Teams below the black line represent values where the adjusted stat is less than the non-adjusted stat. In the example for season offensive efficiency above, these points would represent teams that overall played against defenses that were below average during the regular season.
Home Court Advantage Data
This section of the app allows the user to select a specific season, along with a certain team statistic, and get back the home and away court advantage for that stat. Home court advantage was calculated by taking the average of a given statistic for all home teams and subtracting that value from the average of that statistic for all teams - home, away, or neutral.
From the plot above, we can see that the away teams scored two points lower on average, whereas the home teams scored two points higher on average.
The above plot shows us the trend for the score advantage dating back to the 2010 season. For the 2021 season, we see a slight jump in the away team (red line) scoring disadvantage and a slight dip in the home team (green line) scoring advantage. These slight deviations from previous season scoring advantages may be due to the effect of not having fans in the arenas due to COVID.
Traits of Successful Tournament Teams
This section of the app allows the user to select a regular season team statistic. Displayed are multiple violin plots, each showing the distribution of that particular regular season statistic for teams that made it to a particular round in the tournament. For example, in the plot below, we see that teams who made it far in the tournament tended to have lower regular season adjusted defensive efficiencies than teams who lost early.
Making Data Predictions
This section of the app allows a user to input two teams and get back a predicted score if these two teams were to play each other. I calculated the predicted score for both teams by calculating the expected offensive efficiency for Team 1 and Team 2, dividing those values by 100 and multiplying by the expected tempo. The expected offensive efficiency is calculated by:
Expected_OE_Team1 = National_Season_Avg_Adjusted_OE + (Season_Avg_Adjusted_OE_Team1 - National_Season_Avg_Adjusted_OE) + (Season_Avg_Adjusted_DE_Team2 - National_Season_Avg_Adjusted_OE)
Expected_OE_Team2 = National_Season_Avg_Adjusted_OE + (Season_Avg_Adjusted_OE_Team2 - National_Season_Avg_Adjusted_OE) + (Season_Avg_Adjusted_DE_Team1 - National_Season_Avg_Adjusted_OE)
This predictive model is built on data only from the 2021 regular season and is meant to simulate games for the 2021 season only.
If the user wanted to see what the predicted score would be for a game between Loyola-Chicago and Notre Dame, we would get the results shown above.
Data Insights and Trends
Based on my analysis, it seems that two of the strongest predictors of success in the tournament are a team's win percentage during the regular season and a team's adjusted efficiency margin (adjusted offensive efficiency - adjusted defensive efficiency) during the regular season. Since 2010, the average regular season win percentage for teams that make it to the National Championship Game is greater than 80 percent. Teams that made it to the elite 8 or father since 2010 have had adjusted efficiency margins greater than 20 points per 100 possessions.
When it comes to adjusted offensive efficiency and adjusted defensive efficiency, it appears that adjusted offensive efficiency during the regular season is a better predictor of tournament success than adjusted defensive efficiency. The disparity between adjusted offensive efficiency for teams that make it to later rounds vs teams that lose early is much greater than the disparity for adjusted defensive efficiency.
Adjusted stats also appear to show the true skill level of a team much better than non-adjusted stats. In a college sports where the disparity between the best teams and the worst teams is much larger than professional sports, adjusting stats is essential to get meaningful results.
I also noticed that my rankings for adjusted offensive efficiency, adjusted defensive efficiency, and adjusted efficiency margin were pretty peculiar for the 2021 season. Colgate, for example, appeared to be in the top ten for adjusted efficiency margin and people who followed the season would know that is likely not accurate. These oddities for the 2021 season output are likely due to the effect that COVID had on a lot of teams during this season. Players were out for long periods of time and many games were cancelled causing a lot of teams to play a varying number of games.
In future revisions of this project, I would like to adjust stats based on whether or not a game was played at a team's home location. Perhaps adjusted offensive efficiencies go up during home games and down during away games. I believe accounting for this could lead to better predictions for tournament games as tournament games are played on a neutral court.
I would also like to update the predictive model for the 2021 season and try to account for some of the irregularities that resulted from COVID.
Predictions were calculated by getting the expected adjusted offensive efficiencies and tempo for both teams. I would like to try and run all my obtained feature variables through a machine learning model and get weight coefficients for each statistic.