Data Analysis on College Football with R

Posted on Nov 13, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Wann-Jiun Ma. He is currently in the NYC Data Science Academy Online Data Science Bootcamp program. This post is based on his first class project - Data Analysis and Visualization with R.


As the college football enters second half of the 2016-2017 season, it is time to examine the performance of each team and predict which team has the highest chance to be the playoff contender for the national championship game on January 9, 2017. In this project, we use the past 10 year (2006-2015) NCAA college football data downloaded from to perform exploratory data analysis (EDA) for college football.

The college football dataset has many different types of features to play with. There are two main categories in the dataset: yardage related and non-yardage related. The yardage related data consist of passing and rushing yards achieved in each game. The non-yardage related data include passing/rushing attempts, the number of fumbles lost and interceptions thrown, etc. We would like to explore the football dataset and find the insights from the data. The results obtained by EDA can be further used to design our machine learning models.

Since the main goal is to predict the result of each football game, the most important question that we should first ask is what features would likely impose a direct impact on the result of a football game. Does a winning team throw less interceptions than a losing team? Is the number of the fourth down conversions a major factor to decide the result of a game? Does a team making more passing completions have a higher chance to win a game? The list can go on and on. So let us start to explore the dataset.

Data Analysis and Visualization

When we download the dataset from the website, each row in the table represents the result of each game played from 2005 to 2015, and columns are the features such as visitor team, home team, and different types of the records of the corresponding game.

There are more than 8202 games played from 2005 to 2015. It becomes easier to play with the data tables if we separate the row to two different rows: one for visitor team and the other for the home team. We should also add a column to store the results of the games. Then, we can use "Group By" to examine the correlation between different features (e.g., passing/rushing yards, attempts, etc.) and the results of the games. We first perform such date table transformations to transfer the table into the forms that have satisfied our needs. All codes can be found at

Winning vs Losing Team

After the transformations, we are ready to generate our first plot. Let's have an overview of the features.

Data Analysis on College Football with R

The bar chart shows a comparison between wining teams and losing teams with respect to all non-yardage related features. It looks like the winning teams have more rushing attempts than the losing teams. Note that losing teams have more passing attempts than winning teams. Maybe this is because when you pass footballs more, the chance that the balls intercepted is higher than only running the football. Now, let's perform a similar analysis for the yardage related features.

Rushing Yards

Data Analysis on College Football with R

Indeed, it looks like the winning teams have more rushing yards achieved than the losing teams. Winning teams have more than 50% of rushing yards than losing teams! Although the wining teams also have more passing yards than the losing teams, the difference is not so significant comparing to the rushing yards.


Data Analysis on College Football with R

In addition to the mean values, we also compare the different statistical values of the rushing yards for winning and losing teams. We plot the entire distributions of rushing yards for both the winning and losing teams. Obviously, winning teams have a distributions with more rushing yards than the losing teams.


Rushing Attempts

Again, we can do the same thing for rushing attempts. We plot the entire distributions of rushing attempts for both the winning and losing teams. We can see that winning teams have distributions with more rushing attempts than losing teams.


From our previous analysis, it looks like that the rushing information such as rushing yards and attempts plays an important role in determining the outcome of a football game. This motivates us to use rushing yards and attempts to visualize the results of the football games. We plot the winning and losing results using rushing yards and attempts. The figure shows that there are two groups. The group with winning records tends to have more rushing yards and attempts than the group with losing records. Such kind of insights can help us to build up our machine learning models.

Total Wins


In addition to label winning and losing teams using two different colors, we can also consider the total wins of each team form 2005 to 2015. We plot the total wins of each team using rushing yards and attempts. The figure shows that teams with more total wins tend to have more rushing yards and rushing attempts.


Next, we group by each team and plot rushing yards, rushing attempts, passing yards, and passing attempts as functions of the total wins. We use linear regression to find the relations between total wins and the four features. The figure shows that there is a linear correlation between rushing yards and the total wins, which has verified our guess that winning teams have more rushing yards than losing teams.


By doing exploratory data analysis, we have found out that a winning team tends to run the football more than a losing team. There are many other features and relations that we can explore. For example, it would be interesting to see how the game evolves. Do teams run the football more than they did 10 years ago? Does the current football game tend to have less interceptions thrown since teams prefer to run the football more?

We have not yet explored such kind of relations between each team and the corresponding result of the game. There are certain teams dominating today’s football game. What are the unique characteristics of these teams? Do they run the football more than the other teams? All information generated can help us to build up machine learning models to predict the results of football games.

About Author

Wann-Jiun Ma

Wann-Jiun Ma (PhD Electrical Engineering) is a Postdoctoral Associate at Duke University. His research is focused on mathematical modeling, algorithm design, and software/experiment implementation for large-scale systems such as wireless sensor networks and energy analytics. After having exposed...
View all posts by Wann-Jiun Ma >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI