A Data analysis of the College Scorecard.

Posted on Jul 23, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


In 2017, data shows Americans are burdened by more student loan debt than ever in history - over $1.4 trillion in debt among 44 million borrowers. That number is likely to grow as more people attend college and tuition continues to rise. At the same time, a college degree has become more of a necessity in today's economy, as the average salary of a worker with a college education is more than twice that of someone with a high school diploma.

Given the advantages of a college degree paired with rising costs, it is important for prospective students to be able compare costs across schools, as well as assess their post-graduation outcomes.

The Data 

I used the College Scorecard data released by the Department of Education for my analysis. The Department began releasing the College Scorecard in 2015 to improve transparency in higher education and hold colleges accountable for measures like value and quality. The full data set has information on almost 8,000 institutions in the United States, including community colleges, undergraduate schools, and post-graduate institutions like law and medical schools. It also contains over 1,500 variables, including:

  • School Type: Whether the institution’s governance structure is public, private nonprofit, or private for-profit.
  • Net Tuition Revenue: Tuition revenue minus discounts and allowances, divided by the number of full-time students.
  • Average Cost: Average annual cost of attendance, including tuition and fees, books and supplies, and living expenses for all students who receive federal aid.
  • Median Earnings:  Median earnings for all federally aided students. Data is available for each year starting six years after a student enrolls in college and up to 10 years after the student enrolls (for this analysis, I used 10-year earnings data).
  • Median Graduate Debt: The median loan debt accumulated at the institution by all student borrowers of federal loans (debt for students who left the institution before graduating is tracked separately).
  • Default Rate: The three-year cohort default rate percentage at the institution.

It is important to mention that many of these variables, including median earnings and graduate debt, only apply to student borrowers of federal loans and may not be representative of students who have private loans or no student debt.

Research Questions

For my analysis, I specifically looked at institutions that offer four-year undergraduate degrees and focused on variables related to cost and post-graduation outcomes. I attempted to answer the following questions:

  1. Best Value Schools: Which schools both cost less and provide students with higher earning potential? Which schools have high overall costs but poor outcomes?
  2. State Variation: Do college costs and outcomes vary by state? What states have students with the highest and lowest earnings?
  3. Outcomes by School Type: Does the data validate recent coverage of private for-profit schools? Specifically, do they target low-income students and result in worse outcomes?

Data Insights

Best Value Schools

First, I wanted to get a sense of earnings and employment prospects of former students and compare that against the average cost of each school. For prospective students considering loans to pay for college, it might be valuable to understand where they can get the best "bang for their buck" - schools with low average costs but relatively high earnings among former students. I graphed average costs for each institution against 10-year median earnings, separated by school type.

 A Data analysis of the College Scorecard.

Some insights from the graph included:

  • Unsurprisingly, public institutions have lower average costs overall than private non-profit or private for-profit schools. Some of the schools with the lowest costs and highest median earnings include selective public schools such as University of Virginia, the University of California schools, and University of Michigan.
  • Program emphasis likely has an impact on student outcomes. Schools with high proportions of STEM majors (Georgia Institute of Technology, New Jersey Institute of Technology, Massachusetts Institute of Technology) have relatively high median earnings.
  • Less selective (60%+ admission rate) private colleges are most likely to have the highest annual costs and relatively low median earnings. While there are many factors that influence college selection, this is something students should keep in mind, especially if they plan on borrowing money to attend.

State Variation

Next, I was curious to see what average costs, debt, and earnings look like across the United States. I used the College Scorecard data, grouped by state, to create a heat map of each value with leaflet.

 A Data analysis of the College Scorecard.

From mapping the data, I found that:

  • While Massachusetts has the highest overall cost for college, Delaware is the state with the highest student debt after graduation ($27,546).
  • It is not surprising that states with high costs of living (e.g., the Northeast and California) have high college costs, since cost takes living expenses into account.
  • Wyoming is the state with both the lowest cost and lowest debt in the country, though that is based on a limited number of data points.
  • Many states in the South and Midwest with low overall college costs have comparatively high post-graduation debt (e.g.,  Alabama, Mississippi). Given that those states have lower median household incomes, students in those states may still need to take out larger federal loans despite lower college costs.
  • Finally, earnings data by state looks similar to existing data about median household incomes. The District of Columbia has the highest overall earnings ($50,656) in the country, along with many states in the tri-state area. Mississippi has the lowest overall earnings ($33,320), followed by South Carolina ($34,570). This data is based on information from students' W-2 forms, and is not adjusted for cost of living.

Outcomes by School Type

Finally, I wanted to see whether there was a difference across school type based on several different variables. Specifically, I wanted to look at demographic and outcomes data for private for-profit colleges, which have received criticism in the United States for their predatory recruitment practices and poor post-graduation opportunities.

 A Data analysis of the College Scorecard.

Insights included:

  • Interestingly, the median family income of for-profit college students is concentrated at the very low end of the scale and is significantly lower than family income for both public and private nonprofit students. This is notable because, overall, public colleges have a lower price tag than for-profit colleges.
  • Three-year default rates are also higher among for-profit college graduates. The Department of Education withholds federal loans from many for-profit colleges because of their high default rates, so these figures do not even include default rates for students with private loan debt.
  • Median earnings appear similarly dispersed across all school types, which was unexpected. Earnings are concentrated around $40,000 with tails at both ends.


The College Scorecard data definitely has its shortcomings - much of its data is based on students who have federal loans, and it may not completely represent the full undergraduate population. However, it also provides a trove of information that was previously unavailable, including data on student outcomes. While no single data point can capture a school's "value," the College Scorecard is a very useful resource for prospective college students to understand and compare different schools across a variety of important metrics. I invite you to interact with my Shiny App to further explore the data and my insights.

Link to my GitHub.

About Author

Julia Goldstein

Julia has over five years of experience delivering business insight through data analysis and visualization. As an analytics and management consultant, she was responsible for managing projects, identifying solutions, and developing support among senior-level stakeholders. Moving forward, Julia...
View all posts by Julia Goldstein >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI