Loan Data Analysis and Visualization using Lending Club Data
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
LendingClub, data shows Corp LC is the first and largest online Peer-to-Peer (“P2P”) platform to facilitate lending and borrowing of unsecured loans ranging from $1,000 to $35,000. Aiming at providing lower cost transaction fees than other financial intermediaries, LendingClub hit the highest IPO in the tech sector in 2014.
This project analyzes the personal loan payment dataset of LendingClub Corp, LC, available on Kaggle.com (click here) to better understand the best borrower profile for investors.
The dataset covers an extensive amount of information on the borrower's side that was originally available to lenders when they made investment choices. By further segmenting the loan dataset into finished cases and current outstanding loans, this project breaks down the composition of the default cases and examines the correlation among indicators. In the end, the goal is to provide investors and borrowers, as well as LendingClub, additional insights regarding investment opportunities and contingent loan collection advice. (Please note that for the purpose of the visualization effects and simplicity of diagrams, this project re-coded some of the items with little or no observations.)
II. Data Analysis
II.1 Interest Rate Vs. Number of Approved Cases
We can almost always regard interest rates charged upon loan insurance as a form of cost that borrowers have to incur and the number of approved cases as an indicator of demand. By rough eye balling, the two time series plot of average interest rate and number of approved loans over time corresponds quite closely with each other. Exceptions are the plummet of interest rates in late 2007, thanks to VC fund injection in the figure above, and fluctuations for the number of Approved Cases around 2015 in the figure below (because of the managerial scandals).(click here for more information).
Therefore, it comes as no surprise that a scatter plot of interest rates and number of approved cases for the time period presents a positive relationship, as all else being equal, increasing demand drives up the prices.
- Tips for investors: Speculate the funds market the same way you do for any other investment opportunities!
II.2 Sample Default Indicator Breakdown
This section briefly discusses two of the indicators as an example of the richness of the dataset: Home Ownership Types and the borrower's rating grade.
As can be inferred from Figure 4, the stack of counts under the account 'Fully Paid' is much higher than the ones under 'Default'. Thus fortunately for the LendingClub investors, most of them were able to receive their funds with pre-allocated interest rate. We can also infer from the histogram that there are relatively more applications with mortgage and rental places than those who own their own place.
As can be seen from the graph above, there is no relationship between the type of Home Ownership and default rate. (However, a closer examination of the ratio of default by types of homeownership, the probability of default for the past observations are almost identical.). What this means is that there is an equal chance for applicants with different housing types to default. Rating grade, on the other hand, has a more direct relationship to default. The probability of default increases stepwise as we move down the rating grade of borrowers.
- Tips for LendingClub: Exert extra scrutiny for the applicants with lower ratings!
II.3 Interest Rate Vs. Default Rate
Since interest rates are calculated based on the profile of an applicant, interest rate plots are good indications of the quality of the application pool. As can be seen above, average observed interest rates differ by month, year, and geography. The lowest average interest rate occurred in July and November and highest occurred in June. Applicants from Idaho and Iowa, and Maine experienced relatively much lower rates on average than the ones from Indiana and Tennessee.
Interestingly, the shade of color for average default rate by state reflects pretty much the opposite of the one for interest rate. And by plotting them together in a scatter plot with LM curve, there is a clear positive relation quite comparable to the relationship of increasing risk premium to compensate risk.
- Tips for the investor: research on the risk and diversify!
II.4 An Example of Expected Loss Prediction
Last but not the least, to demonstrate the predictive power of the dataset, this section presents an application of logistic regression to estimate the expected loss using the segmented data on loans whose status are listed as 'Current'.
The expected loss is defined by the following equation:
where the expected loss for state i is the summation of each probability of default times the payment gap, defined as the difference between total amount of the loan and the amount already paid at a specific point in time.
The probability of default is obtained by matrix transformation based on the parameters estimated from a training set, with variables as annual income, funded amount, home ownership, borrower's grade and the amount of the installment. The logit probability cut off is set at 0.7 for visualization effects. The results, based on the model assumptions, show that the states of California, Texas, New York and Florida are the ones with heaviest risk of large losses, whereas the mid-west states present a much more optimistic loan payment expectation.
- Tips for LendingClub: Allocate more resources into loan collection for the darker states!
The project uses visualization to analyze LendingClub’s loan applicants and extends to an application of logit regression for future loss estimation. I find that the trait of applicants usually exhibit quite different default probabilities, especially the probability of default for rating grades goes up stepwise with lower ratings.
In addition, average interest rates differs quite a lot across states and time, and serve as a good indicator of the application pool of the borrowers. Lastly, the expected loss for the outstanding loans at time being is relatively much higher in California, Texas, New York, and Florida, that more resources should be allotted to loan recollection and screening for new applications in these states.