College Football EDA with R

Wann-Jiun Ma
Posted on Nov 13, 2016

Contributed by Wann-Jiun Ma. He is currently in the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his first class project - Data Analysis and Visualization with R.


As the college football enters second half of the 2016-2017 season, it is time to examine the performance of each team and predict which team has the highest chance to be the playoff contender for the national championship game on January 9, 2017. In this project, we use the past 10 year (2006-2015) NCAA college football data downloaded from to perform exploratory data analysis (EDA) for college football. The college football dataset has many different types of features to play with. There are two main categories in the dataset: yardage related and non-yardage related. The yardage related data consist of passing and rushing yards achieved in each game. The non-yardage related data include passing/rushing attempts, the number of fumbles lost and interceptions thrown, etc. We would like to explore the football dataset and find the insights from the data. The results obtained by EDA can be further used to design our machine learning models.

Since the main goal is to predict the result of each football game, the most important question that we should first ask is what features would likely impose a direct impact on the result of a football game. Does a winning team throw less interceptions than a losing team? Is the number of the fourth down conversions a major factor to decide the result of a game? Does a team making more passing completions have a higher chance to win a game? The list can go on and on. So let us start to explore the dataset.

Data Analysis and Visualization

When we download the dataset from the website, each row in the table represents the result of each game played from 2005 to 2015, and columns are the features such as visitor team, home team, and different types of the records of the corresponding game. There are more than 8202 games played from 2005 to 2015. It becomes easier to play with the data tables if we separate the row to two different rows: one for visitor team and the other for the home team. We should also add a column to store the results of the games. Then, we can use "Group By" to examine the correlation between different features (e.g., passing/rushing yards, attempts, etc.) and the results of the games. We first perform such date table transformations to transfer the table into the forms that have satisfied our needs. All codes can be found at

After the transformations, we are ready to generate our first plot. Let's have an overview of the features.


The bar chart shows a comparison between wining teams and losing teams with respect to all non-yardage related features. It looks like the winning teams have more rushing attempts than the losing teams. Note that losing teams have more passing attempts than winning teams. Maybe this is because when you pass footballs more, the chance that the balls intercepted is higher than only running the football. Now, let's perform a similar analysis for the yardage related features.


Indeed, it looks like the winning teams have more rushing yards achieved than the losing teams. Winning teams have more than 50% of rushing yards than losing teams! Although the wining teams also have more passing yards than the losing teams, the difference is not so significant comparing to the rushing yards.


In addition to the mean values, we also compare the different statistical values of the rushing yards for winning and losing teams. We plot the entire distributions of rushing yards for both the winning and losing teams. Obviously, winning teams have a distributions with more rushing yards than the losing teams.


Again, we can do the same thing for rushing attempts. We plot the entire distributions of rushing attempts for both the winning and losing teams. We can see that winning teams have distributions with more rushing attempts than losing teams.


From our previous analysis, it looks like that the rushing information such as rushing yards and attempts plays an important role in determining the outcome of a football game. This motivates us to use rushing yards and attempts to visualize the results of the football games. We plot the winning and losing results using rushing yards and attempts. The figure shows that there are two groups. The group with winning records tends to have more rushing yards and attempts than the group with losing records. Such kind of insights can help us to build up our machine learning models.


In addition to label winning and losing teams using two different colors, we can also consider the total wins of each team form 2005 to 2015. We plot the total wins of each team using rushing yards and attempts. The figure shows that teams with more total wins tend to have more rushing yards and rushing attempts.


Next, we group by each team and plot rushing yards, rushing attempts, passing yards, and passing attempts as functions of the total wins. We use linear regression to find the relations between total wins and the four features. The figure shows that there is a linear correlation between rushing yards and the total wins, which has verified our guess that winning teams have more rushing yards than losing teams.


By doing exploratory data analysis, we have found out that a winning team tends to run the football more than a losing team. There are many other features and relations that we can explore. For example, it would be interesting to see how the game evolves. Do teams run the football more than they did 10 years ago? Does the current football game tend to have less interceptions thrown since teams prefer to run the football more? We have not yet explored such kind of relations between each team and the corresponding result of the game. There are certain teams dominating today’s football game. What are the unique characteristics of these teams? Do they run the football more than the other teams? All information generated can help us to build up machine learning models to predict the results of football games.

About Author

Wann-Jiun Ma

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...
View all posts by Wann-Jiun Ma >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp