Predicting LendingClub Defaults and Returns
The skills the authors demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction to LendingClub
In the days before peer-to-peer (P2P) lending, if you needed money for personal purposes, you had a few standard options: apply for a loan from a bank, rack up credit card debt, or borrow from friends and family. Each of these approaches carried its own hurdles, complexities, and frictions. In 2007, LendingClub saw an opportunity to disrupt these traditional options by creating a P2P lending platform to directly connect individual borrowers and lenders. Thus, they removed the traditional role of the big corporation or bank (in the case of bank loans and credit cards) while greatly expanding an individual borrower’s reach (in the case of family and friends).
A key aspect of LendingClub’s model was the simplicity and frictionless experience for the borrower and lender. The application process for borrowers was straightforward - after providing information such as the purpose of the loan, income information, and information required to retrieve credit report data, the borrower would either be approved or denied. If approved, LendingClub would assign a credit rating ranging from A to G, with A representing borrowers with the highest credit quality and G representing the lowest quality. (F and G rated loans were discontinued in 2017 due to high default rates.)
An interest rate would also be assigned to the loan commensurate with the loan grade/credit quality. The process was friendly for the lender as well. After creating and funding an account, one could easily browse and invest in any of the hundreds of thousands of loans seeking funding. The lender could search and filter through loans based on aspects such as loan grade, income, debt-to-income ratios, loan purpose, to find specific loans that fit the lender’s investment criteria.
Objective
While LendingClub presented investors with an exciting and novel investment opportunity, one of the key concerns was the risk of default. LendingClub loans were unsecured personal loans, meaning that there was no collateral backing the loan. If the borrower defaulted, the lender would generally lose all remaining interest and principal.
While the investment losses incurred in a loan default could be severe and daunting, it is also well understood that predicting defaults is a task that is well suited for a classification model. Therefore, our primary objective was to train multiple models to accurately predict loan defaults. With a high performing model in hand, a LendingClub lender would be able to navigate the thousands of loans with more confidence to achieve higher returns with less risk.
Another important objective of this project was to perform exploratory data analysis (EDA) to fully understand and analyze the LendingClub loan portfolio. Findings from this process would also be critical in further enhancing investment returns while reducing risk by addressing the concerns a discerning investor would ask, such as which loan attributes were most correlated with defaults, what the average interest rates were for each loan grade, and how the quality of loans may have changed over time.
Data Exploration & Model Preparation of LendingClub Data - Pt. 1
Before the modeling process and EDA could start, there were 3 major challenges that had to be solved.
First, the dataset contains information about the loans that originated on the LendingClub platform from 2007 to 2018. Therefore, the initial size of the dataset was over 1.6GB with approximately 2.6 million rows and 151 columns for each loan. In order to work with such a large dataset, we utilized Dask (a pandas-like library for processing large datasets) and Coiled (a user-friendly framework that sets up clusters to add computing power).
Second, in order to prevent data leakage, it is important to separate information that is known at the time of loan origination from future information that informs loan performance. After parsing through the 151 columns and creating a data dictionary, we decided which fields to keep and which fields to reject.
Data Exploration & Model Preparation of LendingClub Data - Pt. 2
After some investigations, we also found out that LendingClub pulled the credit report of borrowers on a regular basis and updated important fields such as the number of credit lines the borrower opened in the last 6 months. We came to the conclusion that we could not use those fields. At the end of the process, we kept 30 fields that we thought were relevant.
Finally, the imbalanced nature of a loan default dataset posed a problem when creating models. In this dataset, ‘charged off’ indicated loan default and was the minority class with a representation of only 14.4%. Therefore, if an investor invested blindly into a random set of loans on the platform, we expected that 14.4% of them would have defaulted.
The imbalanced dataset also posed a problem because the Null Model, i.e., a model that predicts the majority class regardless of the observation, already has an accuracy of 85%. This indicated that accuracy alone would not be a sufficient metric when building our model. We needed to explore other metrics as well, like precision, specificity and sensitivity.
LendingClub Data Visualization
LendingClub Loan Duration
An interesting find in our general exploratory data analysis involved the duration of loans. Across both 36-month and 60-month durations, we found that on average, the loans did not make it to the full term due to either default or prepayment.
The trend we found was that the duration diminishes as the loan grade lowers for both fully paid and defaulted loans. This explains that the lower-grade, riskier loans tended to default earlier. Even if they were paid in full, they tended to repay faster. This could be explained by the higher interest rates in the lower-grade loans. It would be in the best interest of borrowers to pay off their loans as fast as possible to avoid paying more in interest.
LendingClub Interest Rate
To analyze the interest rates further, we compared the interest rate of LendingClub against the bank prime loan rates during the same period of time. The prime rate is a commercial bank rate that works as the basis for many loans such as mortgages, small business, personal, and commercial loans.
As shown on the graph above, the prime rate remained the same between 2008 and 2015. However, LendingClub’s loan rates had generally increased during this time, indicating that the rise in rates were not due to external factors.
This justified the increased interest rate, because these two pieces of information highlighted the pattern of LendingClub approving riskier borrowers. While this would generate more business, it also led to higher default rates.
LendingClub Default Rate
What is any investor's worst fear? Default. This led us to focus much of our EDA on drivers of these increasing default rates evident in the graphs below. Split up by term, one can see default rates slowly trending upwards throughout the time period.
A possible explanation of this is the quality of loans going down as displayed in the graphs of DTI and average FICO Score over time. Default rates were calculated by taking the number of loans charged off divided by the total number of loans issued. In addition, across the whole dataset, lower grade loans are considered riskier because they exhibit higher default rates.
Default Rate vs Key Features
In order to be able to apply our EDA findings to improve our models, we analyzed key features that drove the default rate. By binning some of the continuous features, it was simple to evaluate the default rates within each subgroup. Below, it is evident that the lower the income, the higher the rate of default.
The graph below displays the loan purposes that borrowers listed when applying for a loan. Small business loans had the highest rate of default and was significantly higher than the other loan purposes.
In addition, another feature we looked at was the type of home ownership the borrower listed on their application. We can see renters defaulting at a higher rate than those who owned their home.
The last key feature analyzed was employment length. Borrowers were asked to list an employment length when applying for their loan and a significantly higher default rate was uncovered for those who did not provide this information in their application. It can be concluded that this data was “missing not at random” (MNAR) and could not be ignored in our analysis.
Models
We used five different models to predict which loans would default (charged off) and which would end up fully paid. The models were:
Logistic Regression
This model was the first one we tried for the classification task due to its efficiency and interpretability. Given the imbalanced nature of the training data, the initial run of the model with only hyperparameter tuning yielded a model that essentially matched the null model. We then utilized Synthetic Minority Oversampling Technique (SMOTE) to oversample the minority class (defaulting loans) in the training data.
This led to mixed results: the model’s precision in predicting fully paid loans improved (91% versus the null default rate of 85%), but it came at a cost. The model erred too much on the side of safety. It predicted certain loans would default when they actually would pay in full. This opportunity cost became very apparent when comparing the total return of the model portfolio (8.41%) versus the null portfolio (9.34%). Interestingly, we found that this underperformance was due to the model conservatively avoiding many of the higher paying loans (loans with 10+% returns).
Support Vector Machine
After applying grid search to find the best hyperparameters, the linear kernel outperformed the polynomial and radial kernels. Unfortunately, this model was computationally expensive while only producing results similar to the null model.
Linear Discriminant Analysis (LDA)
LDA models the conditional probability that an observation belongs to each of the classes and assigns it to the class with the higher probability. After the ‘prior’ parameter was tuned, and using ‘balanced accuracy’ as a metric, the best results were obtained. Sadly, while it performed decently at predicting charged off loans, it was too risk averse and missed out on riskier and more lucrative loans. The return on investment did not exceed the null model.
Random Forest
A non-linear model we tested was Random Forest. Hyperparameter tuning involved performing gridsearch on the number of trees, max features, max depth, and setting the class weight equal to balanced in order to deal with the imbalanced dataset. However, it performed very similarly to the other models; returns outperformed the null return but only by a small margin. More complex boosting methods were required to improve our results.
Catboost
The final model we used was CatBoost, a tree-based gradient boosting model, similar to XGBoost and LightGBM. Each iteration learned from the previous iterations to improve on the error. The name CatBoost alludes to how it deals internally with categorical features without having to preprocess them with dummification or one-hot encoding. We broke up the CatBoost model into two sections: a run for 36-month loans and a run for 60-month loans. The results were aggregated together to form the final results.
Catboost Results
Results of each model were compared to the null model where everything was predicted as the majority class (which was fully paid) as well as to the idealized model (where everything was correctly predicted).
Out of the 5 models we tested, the best results came from the CatBoost model.
Because CatBoost is a tree-based model, we were able to obtain a feature importance graph (shown below).
The most important feature was zip code, a proxy for location. Debt-to-Income is a measure of how much a person owes in relation to how much they make. It is known as a coverage ratio. “Days since first credit” shows how long the person has been borrowing money. Annual income is a measure of how much the borrower makes. Issue Date is a proxy for macroeconomic conditions at the time. Installment and Interest rate are characteristics of the loan.
Using the predictions from the CatBoost model, we tried to optimize the portfolio by leveraging the exploratory data analysis work we performed. We removed those loans that seemed to default more often than their peers without those characteristics. This led us to make the following exclusions:
No...
- Small business loans
- Loans where employment length was N/A
- Borrowers who were renters
- Nevada residents
- Annual incomes less than $42,000
Removing loans from these borrowers allowed us to increase our total return from 12.44% to 12.65% for a total of 21 basis points.
Results & Conclusion
While we were satisfied that we had a model that significantly outperformed the null model, it was important to look at the bigger picture and to compare LendingClub overall against traditional investment options.
When viewed against other options such as risk-free treasuries, AAA corporate bonds, or the S&P 500, it became clear that LendingClub loans overall did not offer enough return given the high risks of default and prepayment. Investors could achieve the LendingClub idealized return of 8.12% with far less risk by allocating their assets to a mix of equities, real estate, Treasury bonds, and corporate bonds.
In conclusion, our model is able to improve returns if one were to use LendingClub as an investment option. However, we would recommend less risky investments that would generate similar rates of return.
Supporting code for this article can be found on GitHub.