FHFA Home Mortgage Data Insights

Radhey Shyam
Posted on May 16, 2016

Contributed by Radhey Shyam. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his second class project - R Shiny (due on the 4th week of the program).


Federal Housing Finance Agency(FHFA) is an independent regulatory agency responsible for monitoring the secondary mortgage markets  consisting of government  sponsored enterprises of  Fannie Mae, Freddie Mac and the Federal Home Loan Bank System.  FHFA with its  combined entities provide more than $5.5 trillion in funding for the U.S.  mortgage markets and financial institutions. It was created in July of 2008 under  Housing and Economic Recovery Act of 2008 and under this, the data is made available for  public use.


For this project, I used Single-Family Census Tract File from FHFA website. It consists of data thirty nine columns having detailed information such as gender, race, and age of the borrower as well as the census tract location of the property . I combined the data files from five years from 2010-2014 and final combined consisting of  more than 22 million rows and 39 columns. For analyzing such a huge data, I decided to concentrate on some key factors like age, race and gender of the borrowers and co-borrowers from all 50 states.

My aim  was to find that is there any difference between the gender, race and age profiles of borrowers  and overall trend of the home mortgage data.

First of all, data text files are available separately for Fannie Mae and Freddie Mac enterprises and  by years wise from 2014 to 2010, so I have to load the text files into the local df by year wise and check their dimensionality and repeat the process for years from 2014 to 2010.


There are no missing values in the datasets but there are numerical codes for not available and not available values.For example , there were 171 rows out of 1.89 million rows ( less than 0.010 percent)  that have missing  state codes  in 2014 Fannie Mac data files and there were no missing state codes in Freddie Mac file in all years. Similarly for all years, Fannie Mae files have less than 0.010 percent missing state codes and surprisedly, Freddie  Mac have no missing state codes. So, for practical purposes, I decided to delete missing state codes rows than rather  imputing them.


Repeating the same steps for checking for missing of values for all other years .Finally , I bind all the rows together and final data frame contains more than $22 millions rows.


Now, I decided to created  another data frame having summary of state-wise results because Shiny web was very slow and took lot of time for loading such a large date-frame.  Also, the data contains only state codes, I have to create another table for state codename from R's  state.abb and state.name . But it has information for only 50 states . But I verified and found that our data set  contains 54 state codes , So I have to found additional codes from MSA files for DC,GUAM, ADJUNTAS MUNICIPIO and ST CROIX ISLAND . Then, I manually inserted then into the state  full name table. For finding state wise results, I decide to first group_by year, then by state code and finally by gender and summarize  the final group_by df by gender count.


Designing and creating the Shiny App:

After the data has been summarized into a smaller data frame, it was faster to run the shiny app. As it is a massive data, I decided to add a select box to select the state and user can also select the columns to display in the data or all data can be loaded into data tab. Below is the  snap shop of the app

Screen Shot 2016-05-15 at 9.30.01 PM

By selecting the state,we can see the bar chart of gender profile of Borrowers in the state for last five years.

We can also see the summarized data of all the states.

Screen Shot 2016-05-15 at 9.43.26 PM

We can select any state , the data will load instantly to show the gender profile of the borrowers.I am also in the process of adding some more columns for analysis for example age, race and income profile of the borrowers.


Once the data was clean and then main part challenge was how to organize such a large data ( more than 22 million rows) for viewing. The main challenge was how to design the data for summary .Once I figured out that , then it was smooth process. The Shinny was fun to work and did not require much coding , the only trick part was how to make your data/bar graphs  change to inputs(by making data reactive) .

I would like to add more factors/tabs  to the app in the future.




About Author

Radhey Shyam

Radhey Shyam

Radhey Shyam received his Ph.D. in Physics from Clemson University, USA and M.S. (Computer Science & Engineering) from GJU of Science & Technology, India. Radhey Shyam did his master's thesis on Network Monitor System, where he applied innovative...
View all posts by Radhey Shyam >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp