Advice for LendingClub Investors
Github Repository
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
LendingClub is a Fintech company that provides people with an alternative to taking out loans from banks. It is a peer-to-peer lending company where borrowers create unsecured personal loans. Investors make money from the interest on the loans (and other fees, like late fees, if they come up), and LendingClub makes money from charging the borrowers an origination fee and the investors a service fee.
Investors are able to look through loan listings on LendingClub's website and select loans that they want to invest in. They are provided with all different information about the borrower and the loan, so they can decide which loans to invest in based on this information.
The goal of an investor is to make as much profit as they can from what they give to the borrower. In other words, the goal is to yield the highest profit percentage possible. The problem is that often the loans with the highest interest rates, which should yield greater profit percentage, don't end up yielding the greatest profit percentage because the loan gets charged off (it ends after a certain amount of time because the borrower stops paying). Therefore, one shouldn't necessarily choose the loans with the highest interest rates, because there is great risk that they won't profit as much as they thought or at all, or even lose money.
The objective of my project is to analyze LendingClub data and give over findings to investors that will help them choose loans with the highest likelihood of optimal profit percentage. The other objective is to create a reliable model for investors that can predict whether a loan will be highly profitable.
LendingClub Dataset and Modeling Goal
LendingClub has a publicly available dataset with 151 features concerning over a million loans from 2007-2018. These features include information from the borrower's credit reports and loan application, scores and grades based on the information, records after the loan takes place, and more. Before a loan becomes active, potential investors are provided with the information that is available before loan origination.
My objective was to build a model, using the data from before loan origination, that can predict whether an investment will yield a high profit percentage. I would then analyze the model and figure out what an investor should look for in a borrower/loan when choosing what to invest in.
Data Cleaning
A lot of research was done on LendingClub data that led to my decisions I made in data cleaning.
I filtered out loans that weren't yet completed, because the goal of the modeling was to predict the end result of loans. This left me with loans of 2 different statuses: "Fully Paid" and "Charged Off" (Loan for which there is no longer a reasonable expectation of further payments, so it ends). I then made sure to only keep loans for which the entire loan was funded by the investor, so that this would be a consistent factor in the data. This caused the removal of a very small percentage of loans.
zI then filtered out loans which had a joint application, i.e., two borrowers, because the goal of the model was to find the characteristics of an ideal borrower/loan, so loans for which there are two borrowers would cause the modeling to be unfair and less accurate. These were a small percentage of the loans. Lastly, I filtered out loans from before 2016, because these loans didn't have as many features as the rest of the loans. This left me with about half a million loans.
I removed features that didn't pertain to the majority of the data and features that were available only after loan origination (besides for what was needed for the target feature). I also removed features with a significant amount of data missing, seemingly randomly. These features weren't essential anyway, because they were subsumed in the FICO score or grade features. The feature "employment length" had a significant amount of null values, and they were imputed with 0 because it seemed they were null due to unemployment. After all this, I removed loans that randomly had null values, which were a tiny percentage of the loans.
Feature Engineering
Each borrower had a FICO score range. Instead of keeping two features, the lower end of the range and the higher end of the range, I created one feature of the average FICO score. I then created a feature of the investor's profit percentage. The formula for profit percentage was the total amount of payment received by the investor, minus the loan amount, divided by the amount funded by the investor, all multiplied by 100. I rounded this number to the nearest percentage.
The target feature that I engineered was the profit percentage range. This divided the data into groups of different ranges of profit percentage.
I encoded the categorical features into numbers so that the machine learning models could use them.
In the end, there were 52 features to use for modeling.
Machine Learning and Results
I split the data by various combinations of profit percentage ranges, and in the end the split that was able to be effectively predicted through machine learning was the split of the ~5% of loans with highest profit percentage against the rest of the data. This meant that there was something unique about the top 5% of loans, and information about what made these loans different from the rest of the data would be greatly beneficial for investors.
The model that was best and most effective at predicting was imbalanced-learn's balanced bagging bagging classifier model. The score on the training data was 83%, and 82% on the testing data, which are good accuracy measures and indicative of no overfitting issue.
The sensitivity of the testing data was 93% and specificity of the testing data was 81%. This means that if a loan is not in the top 5%, the model is very likely to figure that out, and if it is in the top 5%, the model is even more likely to figure that out.
LendingClub Analysis - Pt. 1
Here are the feature importance measures of the model:

We see here that the sub-grade is by far the most important feature in determining whether the loan is in the top 5%. The interest rate and length of loan are also important, and the rest of the features have very little importance. An analysis of the top 3 features would give us the best insight into what an investor should look for in a borrower/loan in order to yield the highest profit percentage possible.

As seen in the bar chart, D2, C5, D1, D3, D4, in descending order, are the most commons sub-grades in the top 5% of loans. We can see that these sub-grades are not nearly as common in the remaining loans.
Investors should look out for these sub-grades, in that order, when choosing what to invest in.

As seen in the red box-plot, which measures the top 5% of loans, the most dense range of interest rates, where 25% of of the loans fall, is 16.59%-18.99%, and the second most dense range of interest rates, where 25% of the loans fall, is 18.99%-22.45%. As seen in the blue box-plot, which measures the remaining loans the most dense range of interest rates is right below these numbers, the interquartile range being 9.44%-15.31%.
LendingClub Analysis - Pt. 2
The thing about interest rates is that they are directly dependent on sub-grade. Not all of the loans with the same sub-grade in the dataset have the same interest rate, because LendingClub recalculates and updates the interest rate for each sub-grade a few times a year. Because the interest rate is dependent on the sub-grade, once a person chooses to look for a loan with a certain sub-grade, they won't have the option of different interest rates.
From analyzing the box-plots, we see that investors should ideally look out for interest rates between 16.59%-18.99%, secondly 18.99%-22.45%, and thirdly as close to 22.45% as possible. So, I'd conclude that if one of the ideal sub-grades mentioned earlier has an interest rate in these ranges, that should be the sub-grade an investor looks out for when choosing a loan. But as noted earlier, choosing a good sub-grade is priority because it is by far the most important feature in the model.

As seen in the bar chart, the top 5% have a far greater proportion of 60 month loans than do the remaining loans. The significance of this for an investor is that 60 month loans are more specifically associated with higher profit percentage than 36 month loans are. Therefore, it may be to their benefit to have a longer loan of 60 months.
LendingClub Risk Analysis
There can be risk in relying on a model that is not 100% accurate.

As seen in the box-plots, the false positives, i.e., what is falsely predicted as being in the top 5%, has a higher likelihood of a lower profit percentage than what is predicted as being in the lower 95%. The mean value for false positives is -18.5%, and the mean value for predicted negatives is -1.8%. So, there is some risk in relying on the 82% accurate model.
Conclusion
In conclusion, we have a model that is great at predicting whether a loan will be in the top 5% of loans. If the model predicts that it will be in the top 5%, it is very likely it will be.
As for advice for investors, for the highest likelihood of yielding a high profit percentage, the sub-grade should be the main thing that the investor looks at. They should look for sub-grades of D2, C5, D1, D3, D4, in that order. The next best category to look at is interest rate, and they should look for interest rates between 16.59%-18.99% or 18.99%-22.45%, in that order, and >22.45% if neither of those ranges are available for one of the ideal sub-grades. Finally, an investor should lean toward choosing a loan of a 60 month term.