An Exploration of Perkins Loan Default Rate Data

Gordon Fleetwood
Posted on Nov 3, 2015

Contributed by Gordon. Gordon took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


In the news cycle driven cog of modern society, we often get caught up in whatever is being discussed by the talking heads on whichever screen we're looking at. When these avatars cease speaking about an issue, it often disappears from our consciousness as well - especially if it doesn't particularly affect us. Out of sight, out of mind, as it were.

The student debt crisis is one of the issues caught in this revolving door, and it also happens to be one of the biggest issues facing a large portion of young Americans. Unofficial counters place the total student debt at well over one trillion dollars. One of this burden's subdivisions are the categories of private and federal loan.

Unsurprisingly, federal loan programs are generally more lenient than those provided by the private sector. Before this exploration, the only Federal loan programs I knew about before doing this project were TAP and FAFSA. It is through the data used in this project that I learned about the Perkins Loan.


[Click Images to Enlarge]

Armed with data from 2011-2014, I began my analysis.

The End Goal

My goal in this first project was to explore the default faults associated with this loan through various visualizations.


The data source has nine years of data but only the data for the three most recent years were not in pdf form. Still, the xlsx files provided were flush with superfluous trappings like conditional formatting and colors. All of those had to be removed before I could load the data into R. Once that manual labor was done, the real work began.

My first round of data cleaning mostly involved merging the data into one. That involved some sensible renaming of columns and adding an additional column to tag the associated year of each data point.

rename.columns = c('Serial',

names(perkins1112) = names(perkins1213) = names(perkins1314) = rename.columns

perkins1314$year='13-14' = rbind(perkins1112, perkins1213, perkins1314)

My second round of cleaning involved introducing state-level granularity. It was important here to apply a per capita scaling since the data would involve a lot of comparisons between states of varying populations. A quick visit to the US Census Bureau's website provided the necessary csv files. It is at this point that some error was intentionally included. The census data is for the calendar year, but the Perkin's loan data is based on the school year. I averaged the population of the consecutive years in question to try to match up the two data files as closely as possible, and then merged them. My data to work with was ready.

data(state.regions) = merge(, state.regions, by.x='ST', by.y='abb') = tbl_df(


In terms of structuring the flow of my visualizations, I decided to go from least to most granularity. I started off with looking at the yearly trend of money owed by those in severe default, which, unsurprisingly, increased year on year.


A similar temporal visualization based on the number of borrowers in severe default showed a similar trend.


Next I looked at state-level data. Using the chloroplethR package, I made a series of chloropleth maps from this state-level data for the three years of data. The gif below shows the default rate in each state over three years.


And the second looked at the average amount of money owed by those in default for more than 240 days.


To end my exploration I went to the lowest level of granularity and looked at all the colleges across all three years as a whole. The highlighted colleges are the ones I thought were interesting, but special emphasis goes to those with a low number of borrowers but a high principal owed.


The CUNY system in New York and Devry in Chicago stood out, as did Johnson and Wales University in Pennsylvania. In fact, the Philadelphia based institution had the distinction of a high volume of loans for a comparatively low number of borrowers.

Looking at individual states North Dakota had the most money owed scaled by population, so I had a look at those colleges.


California was at the other extreme with the least money owed per one million people.


New York is at forty-eight by the same metric.


Comparing New York and California brings up an interesting observation. The data references New York's city college system as a whole while California's equivalent system has its college listed individually. This brings into question the way the data was reported by the colleges, and make one wonder if the government shouldn't establish a standard across the board.


My analysis showed that the northwest of the United States is the mostly severely indebted to the Perkins loan program, with a few states like Maine and Delaware being in a similar state. For the most part, most states seem to have their loans under control from an wide perspective.

This only provides a snapshot of the loan crisis. For the analysis to be hard-hitting I need more data. Economic data of each state would be useful, as would be tuition costs and estimates for cost of living. I hope to expand this analysis in the future.


Here are the slides from my presentation:

And the link to my code:

About Author

Gordon Fleetwood

Gordon Fleetwood

Gordon has a B.A in Pure Mathematics and a M.A. in Applied Mathematics from CUNY Queens College. He briefly worked for a early stage startup where he was involved in building an algorithm to analyze financial data. However,...
View all posts by Gordon Fleetwood >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp