Loan Data Analysis and Visualization using Lending Club Data

Linlin Cheng
Posted on Jul 23, 2016

I. Introduction

LendingClub, Corp LC is the first and largest online Peer-to-Peer (“P2P”) platform to facilitate lending and borrowing of unsecured loans ranging from $1,000 to $35,000. Aiming at providing lower cost transaction fees than other financial intermediaries, LendingClub hit the highest IPO in the tech sector in 2014.

This project analyzes the personal loan payment dataset of LendingClub Corp, LC, available on Kaggle.com (click here) to better understand the best borrower profile for investors.

The dataset covers an extensive amount of information on the borrower's side that was originally available to lenders when they made investment choices.  By further segmenting the loan dataset into finished cases and current outstanding loans, this project breaks down the composition of the default cases and examines the correlation among indicators. In the end, the goal is to provide investors and borrowers, as well as LendingClub, additional insights regarding investment opportunities and contingent loan collection advice. (Please note that for the purpose of the visualization effects and simplicity of diagrams, this project re-coded some of the items with little or no observations.)

II. Analysis

II.1 Interest Rate Vs. Number of Approved Cases

Rplot02

Figure 2. Time Series Plot of Approved Loans Count

We can almost always regard interest rates charged upon loan insurance  as a  form of cost that borrowers have to incur and the number of approved cases as an indicator of demand. By rough eye balling, the two time series plot of average interest rate and number of approved loans over time corresponds quite closely with each other.  Exceptions are the plummet of interest rates in late 2007, thanks to VC fund injection in the figure above, and fluctuations for the number of Approved Cases around 2015 in the figure below (because of the managerial scandals).(click here for more information).

Rplot3

Figure 3. Scatterplot of Interest Rate and Approved Loan Counts

Therefore, it comes as no surprise that a scatter plot of interest rates and number of approved cases for the time period presents a positive relationship, as all else being equal, increasing demand drives up the prices.

  • Tips for investors: Speculate the funds market the same way you do for any other investment opportunities!

II.2 Sample Default Indicator Breakdown

This section briefly discusses two of the indicators as an example of the richness of the dataset: Home Ownership Types and the borrower's rating grade.

As can be inferred from Figure 4, the stack of counts under the account 'Fully Paid' is much higher than the ones under 'Default'. Thus fortunately for the LendingClub investors, most of them were able to receive their funds with pre-allocated interest rate. We can also infer from the histogram that there are relatively more applications with mortgage and rental places than those who own their own place.

Rplot7

Figure 5. Default Ratios on Borrower's Grade

Rplot6

Figure 6. Default Ratios on Borrower's Home Ownership

As can be seen from the graph above, there is no relationship between the type of Home Ownership and default rate. (However, a closer examination of the ratio of default by types of homeownership, the probability of default for the past observations are almost identical.). What this means is that there is an equal chance for applicants with different housing types to default.  Rating grade, on the other hand, has a more direct relationship to default.  The probability of default increases stepwise as we move down the rating grade of borrowers.

  • Tips for LendingClub: Exert extra scrutiny for the applicants with lower ratings!

 

 

II.3 Interest Rate Vs. Default Rate

 

Rplot4

Figure 7. Average Interest Rate by Month

Rplot8

Figure 8. Spatial Plot for Average Interest Rate

Since interest rates are calculated based on the profile of an applicant, interest rate plots are good indications of the quality of the application pool.  As can be seen above, average observed interest rates differ by month, year, and geography. The lowest average interest rate occurred in July and November and highest occurred in June. Applicants from Idaho and Iowa, and Maine experienced relatively much lower rates on average than the ones from Indiana and Tennessee.

Rplot9

Figure 9. Spatial Plot for Default Rate

Rplot_10

Figure 10. Scatterplot for Default Rate and Interest Rate

Interestingly, the shade of color for average default rate by state reflects pretty much the opposite of the one for interest rate. And by plotting them together in a scatter plot with LM curve, there is a clear positive relation quite comparable to the relationship of increasing risk premium to compensate risk.

  • Tips for the investor: research on the risk and diversify!

 

II.4  An Example of Expected Loss Prediction

Last but not the least, to demonstrate the predictive power of the dataset, this section presents an application of logistic regression to estimate the expected loss using the segmented data on loans whose status are listed as 'Current'.

The expected loss is defined by the following equation:

CodeCogsEqn (1)

where the expected loss for state i is the summation of each probability of default times the  payment gap, defined as the difference between total amount of the loan and the amount already paid at a specific point in time.

The probability of default is obtained by matrix transformation based on the parameters estimated from a training set, with variables as annual income, funded amount, home ownership, borrower's grade and the amount of the installment.  The logit probability cut off is set at 0.7 for visualization effects. The results, based on the model assumptions, show that the states of California, Texas, New York and Florida are the ones with heaviest risk of large losses, whereas the mid-west states present a much more optimistic loan payment expectation.

Rplot11

Figure 11. Expected Loss Preview

  • Tips for LendingClub: Allocate more resources into loan collection for the darker states!

III. Conclusion:

The project uses visualization to analyze LendingClub’s loan applicants and extends to an application of logit regression for future loss estimation. I find that the trait of applicants usually exhibit quite different default probabilities, especially the probability of default for rating grades goes up stepwise with lower ratings. In addition,  average interest rates differs quite a lot across states and time, and serve as a good indicator of the application pool of the borrowers. Lastly, the expected loss for the outstanding loans at time being is relatively much higher in California, Texas, New York, and Florida, that more resources should be allotted to  loan recollection and screening for new applications in these states.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Classes Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet Lectures linear regression Live Chat Live Online Bootcamp Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Lectures Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking Realtime Interaction recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp