Peer to Peer Lending Data as an Asset Class for Investments
Github Repo | LinkedIn
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction and Objective with Data
Peer to Peer (P2P) lending is a relatively new business innovation that seeks to empower borrowers and lenders by competing in the lending market with traditional players like banks. P2P lenders such as Lending Club (LC), which has raised close to US$ 400 million to date, including US$ 125 million from Google, work as an intermediary between borrowers and private lenders. It offers information and guidance to lenders and competitive rates to borrowers without assuming the risk of the loans itself. In this project, I analyzed the public data on loans issued with a value of over US$ 33 Billion provided by Lending Club from 2007 to 2019
The objective of the project is to help determine if this new asset class could be of interest to an investment fund.
In order to achieve this objective, I will use tools from data science and finance to build the highest performing portfolios of loans possible to then determine if the results we are achieving from said portfolios are adequate for our fund when compared to other asset classes.
Optimizing our portfolios required building two models, one for predicting defaults and another to project internal rate of returns for each loan. Several other steps are involved in the process, such as split testing and feature engineering, to increase the robustness of the models and improve their accuracy.
Understanding the Data - Pt. 1
Our first objective is to gain an understanding of the data we have at hand in order to build robust and accurate models that align with our business objectives.
First, we will take a look at the value of loans issued:

We can see the exponential growth in loan issuance from 2009 to 2015, from which point it stabilizes with more modest growth in the following years. Still the yearly loan issuance in 2018 is high enough that it could accommodate modest size portfolios for investment funds.
Next, we will explore the target variable of our first model, which is if the loan is fully paid or charged off (defaulted).

As expected, we are dealing with an imbalanced dataset since one class represents only 20% of observations. In this case the null model, always guessing the fully paid class, would have an 80% accuracy. We have to work to improve on this null model. We can use metrics other than accuracy such as recall, sensitivity, F1, and receiver operating characteristic area under the curve (ROC AUC) to get a better diagnostic of how well our model is working. In any case, our final performance metric should be one that measures the profit generated by the investment fund.
We might also want to do under-sampling or over-sampling, which either reduces the number of samples of the over-represented observations or increases the under-represented class. Some algorithms can also account for unbalanced classes.
Understanding the Data - Pt. 2

The previous graph shows both the average loan value and the share of income it represents for the borrower. This information can be interesting as we can see that over time an important factor in the increase of loans was the increase in both loan size and share of income that it represented.
This is something we should be careful about as investors because future conditions might continue to change. While we can pick which loans to invest in, we cannot change the available pool of loans.
Next, we will take a look at the share of loans by purpose:

We can see that the two main purposes, which represented 78% of all loans in 2018, are debt consolidation and credit card. Most people use the Lending Club loans to reduce their interest payments on other loans. These types of loans tend to be safer as they are related to people with a more responsible attitude to their debt obligations.

We can also see the geographical distribution of loans by state in 2018 and corroborate that they are being issued all across the U.S with a particular focus on the most populated states such as California, Texas, Florida, and New York.
Understanding the Data - Pt. 3
The final graph we will be analyzing is of how the interest rates for the loans, which are given by Lending Club, are distributed over time.

We can see an interesting evolution. As time progressed, Lending Club was able to better segment the interest rates it offers based on the risk of their customers. In the beginning the interest rates tended to be more clustered. However, they spread out over time as they included higher risk customers and added more modes that might represent clusters of similar customers.
Now that we have a good understanding of the types of loans available and how they have evolved, we will move on to creating our first model that will predict the probability of default classified as "Charged Off."
Default Prediction Model for Data - Pt. 1
Using the dictionary of available features and its descriptions I reviewed them to determine which were obtained after the loan origination to avoid data leakage. We should only use information available at loan origination in order to avoid having a model that overestimates its predictive power.
Our analysis should only include observations where the target variable, loan_status, has a definite status in order to have a target that can be used to train our models. In this case, either the loan_status can be "Fully Paid" or "Charged Off", any observations with other statuses will be dropped off.
The next table shows the number of missing values of our features:
Joint data that applies to those who apply together with another person is mostly missing because it does not apply to the situation in which there is only one applicant.
Next, we will explore the relation between emp_length and the default rate to see if there is any correlation which could indicate that it is “missing not at random.”

It seems clear that there is a relationship between having a missing value in the emp_status feature and the default rate of the loan, which seems to indicate that it is “missing not at random.” What this means is that the applicant is probably not disclosing this information if he is unemployed or underemployed.
Since we will be using models that can deal with missing values and treat them as their own class, we will not impute the missing values. Still, it is good to know that this effect is probably present.
Default Prediction Model for Data - Pt. 2
I will check if there seems to be the same effect for the other missing values, bankruptcies and dti.
There seems to be some difference. However, they are not major differences. In this instance, the effect seems to be contrary; if it is missing the chance of default goes down. This sample size is much smaller than employment status, so I don't think I can come to a conclusion on this matter.
After this step I prepared the data to be fed into the model by separating the target variable and dummyfing as necessary. I have chosen to use Catboost as my research indicates it has very good performance on both results and speed. This type of model does not need much pre-processing of the data. For example, categorical features have to be specified using an index but do not require label encoding.
Training with Data Algorithms
Catboost is an algorithm for gradient boosting on decision trees, which uses a sequential method, meaning it trains on the previous trees’ error. It is developed by Yandex and was released in 2017. Since its inception it has had some very good results and possesses several useful tools, such as time estimation, GPU training, and cross-validation.
I took a two-step approach to training this model. First, I took a subsample of 10% of the total dataset to cross-validate the hyperparameters of the model. This was intended to save time in the full training.
In general, Catboost is known for having particularly good out-of-the-box results, meaning it does not require much hyperparameter tuning. However, I still wanted to test the best parameters for depth of trees and L2 leaf regularization.
Then I trained the model on the full dataset using the hyperparameter results and also chose the option to auto balance the classes so it gives equal importance to the minority class.
Results of Data Algorithms
The model returned the following feature importances:

We can see that in determining the default probabilities the grade, sub grade, and interest rate are the most important features. There is an important caveat to keep in mind, since all of these features are given by LC; if they were to change their standards for those features our model would most likely lose predictive power.
We also focused on optimizing the ROC area under the curve because we know that we need different decision thresholds to optimize our profit function (IRR), which we will implement later.

Because we used the auto balance parameter in the model we were able to get similar recall results for both classes of 0.68 and 0.66 for Charged Off and Fully Paid, respectively. This is important because we want to give equal importance to predicting defaults which result in a high cost.
Backtesting Spread Thresholds - Pt. 1
With the information for default probability from the previous model I created a new feature called spread, which is the difference between the interest rate and the estimated default probability. In theory, the higher the spread, the more lucrative the loan is, risk adjusted.
With this theory in mind, I wanted to create different loan portfolios composed of loans with different spread thresholds, increasing the necessary spread for a loan to be included in the portfolio. For each of these portfolios I wanted to calculate the size and internal rate of return (IRR). One would expect as the threshold increases that the IRR would also increase and the portfolio size to decrease as we are allowing fewer loans.
To do so, I had to recreate the cash flows for all the loans using the installment, issue date, final payment date, and total amount paid data. I assumed a negative cash flow of the loan amount on the issue date and a positive flow of installments the following months until the last payment date month where I placed the difference between the total amount paid and the sum of all instalments.
Backtesting Spread Thresholds - Pt. 2
The following graph shows the results of portfolios with increasing thresholds for spread with respect to portfolio size and IRR.

Indeed, it appears that as the threshold increases, the IRR also increases. Yet there are also some other factors that eventually lower the IRR, which are probably related to prepayments. The maximum IRR we could get with this method is 6.69%; a random portfolio model gave me a 2.73% return.
What I realized is that spread does seem to be an important factor in predicting IRR, but it is not the only one. It is for this reason that I decided to create a second model that specifically attempts to predict each loan IRR to then create portfolios based on these predictions.
Loan IRR Model
For this model I'm building on the same features as the previous one but also adding spread and default predictions. The new target for this case will be IRR, so this model is doing a regression instead of a classification. I will again use Catboost due to the advantages mentioned previously, although experimenting with other models could provide improvements and can be left for future work.
The following are the feature importances for the model.

We can see how the spread, interest rate, and probability of being fully paid are the most important ones, so it became key for this model to have built the previous one and also to have created the spread feature.
Backtesting Results
I will now create loan portfolios using different percentile thresholds of the best predicted IRR. For example, a portfolio composed of the top 1% of loans that have the highest predicted IRR and calculate the portfolio IRR and size. The following is a graph with the results.
We can also visualize the top results with the following table.

Conclusions
The previous table can be used by the investment firm to decide whether to allocate its funds based on an IRR objective and/or a budget constraint. For example, using loans at the 1.5% percentile, we could allocate approximately 119.04 million of loans in 2018 and get an IRR of 13.82%.
At this IRR, this investment asset, when properly allocated into the right loans, can be a very attractive alternative. The S&P 500 has had an average return of around 10.7% for the last 30 years and 13.9% for the last 10. An investment firm could get similar or higher returns on a diversified portfolio of loans by investing in this type of asset from Lending Club, which depending on the risk profile of the fund, might be more than adequate.
Having said this, I would be wary of our models' dependence on subjective features given by Lending Club, such as grade, subgrade, and interest rate. As mentioned before, if LC were to change its standards for these ratings our predictions could become inaccurate, and we could have no warning of this change.
Future Work
One thing to keep in mind for future work is that the train and test split of the data for the models was done using a random draw. Despite the model not having seen any of the loans when tested, it did use future loans to predict past ones.
However, I don't think this will have affected the results since the test loans are never seen by the models in the training. Nevertheless, it might be best to train the model on past loans and test them on future ones to better represent reality.
We could further improve the robustness of our models by doing rolling time windows train/test splits and graph the results to see the results stay similar over time.
Finally, another area of future work would be to test other models, such as neural networks that could further improve the performance of our portfolios.