Data Analysis on College Football with R
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Wann-Jiun Ma. He is currently in the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his first class project - Data Analysis and Visualization with R.
As the college football enters second half of the 2016-2017 season, it is time to examine the performance of each team and predict which team has the highest chance to be the playoff contender for the national championship game on January 9, 2017. In this project, we use the past 10 year (2006-2015) NCAA college football data downloaded from http://www.drwagpicks.com/p/blog-page.html to perform exploratory data analysis (EDA) for college football.
The college football dataset has many different types of features to play with. There are two main categories in the dataset: yardage related and non-yardage related. The yardage related data consist of passing and rushing yards achieved in each game. The non-yardage related data include passing/rushing attempts, the number of fumbles lost and interceptions thrown, etc. We would like to explore the football dataset and find the insights from the data. The results obtained by EDA can be further used to design our machine learning models.
Since the main goal is to predict the result of each football game, the most important question that we should first ask is what features would likely impose a direct impact on the result of a football game. Does a winning team throw less interceptions than a losing team? Is the number of the fourth down conversions a major factor to decide the result of a game? Does a team making more passing completions have a higher chance to win a game? The list can go on and on. So let us start to explore the dataset.
Data Analysis and Visualization
When we download the dataset from the website, each row in the table represents the result of each game played from 2005 to 2015, and columns are the features such as visitor team, home team, and different types of the records of the corresponding game.
There are more than 8202 games played from 2005 to 2015. It becomes easier to play with the data tables if we separate the row to two different rows: one for visitor team and the other for the home team. We should also add a column to store the results of the games. Then, we can use "Group By" to examine the correlation between different features (e.g., passing/rushing yards, attempts, etc.) and the results of the games. We first perform such date table transformations to transfer the table into the forms that have satisfied our needs. All codes can be found at https://github.com/Wann-Jiun/nycdsa_project_1_eda
Winning vs Losing Team
After the transformations, we are ready to generate our first plot. Let's have an overview of the features.
The bar chart shows a comparison between wining teams and losing teams with respect to all non-yardage related features. It looks like the winning teams have more rushing attempts than the losing teams. Note that losing teams have more passing attempts than winning teams. Maybe this is because when you pass footballs more, the chance that the balls intercepted is higher than only running the football. Now, let's perform a similar analysis for the yardage related features.
Indeed, it looks like the winning teams have more rushing yards achieved than the losing teams. Winning teams have more than 50% of rushing yards than losing teams! Although the wining teams also have more passing yards than the losing teams, the difference is not so significant comparing to the rushing yards.
In addition to the mean values, we also compare the different statistical values of the rushing yards for winning and losing teams. We plot the entire distributions of rushing yards for both the winning and losing teams. Obviously, winning teams have a distributions with more rushing yards than the losing teams.
Again, we can do the same thing for rushing attempts. We plot the entire distributions of rushing attempts for both the winning and losing teams. We can see that winning teams have distributions with more rushing attempts than losing teams.
From our previous analysis, it looks like that the rushing information such as rushing yards and attempts plays an important role in determining the outcome of a football game. This motivates us to use rushing yards and attempts to visualize the results of the football games. We plot the winning and losing results using rushing yards and attempts. The figure shows that there are two groups. The group with winning records tends to have more rushing yards and attempts than the group with losing records. Such kind of insights can help us to build up our machine learning models.
In addition to label winning and losing teams using two different colors, we can also consider the total wins of each team form 2005 to 2015. We plot the total wins of each team using rushing yards and attempts. The figure shows that teams with more total wins tend to have more rushing yards and rushing attempts.
Next, we group by each team and plot rushing yards, rushing attempts, passing yards, and passing attempts as functions of the total wins. We use linear regression to find the relations between total wins and the four features. The figure shows that there is a linear correlation between rushing yards and the total wins, which has verified our guess that winning teams have more rushing yards than losing teams.
By doing exploratory data analysis, we have found out that a winning team tends to run the football more than a losing team. There are many other features and relations that we can explore. For example, it would be interesting to see how the game evolves. Do teams run the football more than they did 10 years ago? Does the current football game tend to have less interceptions thrown since teams prefer to run the football more?
We have not yet explored such kind of relations between each team and the corresponding result of the game. There are certain teams dominating today’s football game. What are the unique characteristics of these teams? Do they run the football more than the other teams? All information generated can help us to build up machine learning models to predict the results of football games.