Lending Peer to Peer Default Predictability for Lending Club
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors in order to earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides is the platform that bridges investors and borrowers.
The dataset used is the complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The dataset contains over 890,00 loan observations (rows) and over 75 features (columns). Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others.
- Most of the loans issued were in the range of $10,000 - $20,000.
- The year of 2015 was the year were most loans were issued.
- Loans were issued in an incremental manner, possibly due to an economic recovery in the U.S. economy.
- The loans applied by potential borrowers, the amount issued to the borrowers and the amount funded by investors are similarly distributed, meaning that it is most likely that qualified borrowers are going to get the loan they had applied for.
Credit Scores vs. Loan Grades
Credit scores are important metrics for the evaluation of overall level of risk. In this section we will analyze the level of risk as a whole and how many loans were bad loans by grade type from customer’s credit score.
What we need to know:
i.) The lower the grade of the credit score, the higher the risk for investors.
ii.) There are different factors that influence on the level of risk of the loan.
Determination of a Bad Loan
The primary factors that increase overall loan risk for investors are low annual income, high debt to income, high interest rates, and low grade. The types of bad loans in the last year are having a tendency to decline, except for late payments (might indicate an economical recovery).
Other notable points are:
- Mortgage was the variable from the home ownership column that used the highest amount borrowed within loans that were considered to be bad.
- There is a slight increase on people who have mortgages that are applying for a loan.
- People who have a mortgage (depending on other factors as well within the mortgage) are more likely to ask for
Analysis by Region
The regional analysis of loans gives us an idea of geographic distribution (and not the risk level of loans based on differing categories). Below are the main points relating to regional analysis observed throughout the project.
- West and SouthEast regions have the most undesirable loan status, but just by a slightly higher percentage compared to the NorthEast region.
- West and SouthEast had a higher percentage in most of Grade B "bad" loans
- The NorthEast region had a higher percentage in Grace Period and does not meet Credit Policy loan status. (both are not considered as bad as default for instance)
Loans Issued by State
- California, Texas, New York and Florida are the states in which the highest amount of loans were issued. Not surprisingly, these four states also compose the largest share of the U.S. annual gross domestic product (GDP).
- Interesting enough, all four states have ~ interest rates of 13% which is at the same level of the average interest rate for all states (13.24%)
- California, Texas and New York are all above the average annual income (with the exclusion of Florida), this might give possible indication why most loans are issued in these states.
Analysis by Income Category
Distinct income categories were created in order to detect important patterns and do an in-depth analysis of this segment. Some important points to consider are:
- Low income category: Borrowers that have an annual income lower or equal to $100,000.
- Medium income category: Borrowers that have an annual income higher than $100,000 but lower or equal to $200,000.
- High income category: Borrowers that have an annual income higher than $200,000.
- Borrowers that made part of the high income category took higher loan amounts than people from low and medium income categories. Of course, people with higher annual incomes are more likely to pay loans with a higher amount. (First row to the left of the subplots)
- Loans that were borrowed by the Low income category had a slightly higher change of becoming a bad loan. (First row to the right of the subplots)
- Borrowers with High and Medium annual incomes had a longer employment length than people with lower incomes.(Second row to the left of the subplots)
- Borrowers with a lower income had on average higher interest rates while people with a higher annual income had lower interest rates on their loans. (Second row to the right of the subplots)
Good/Bad Loan Summary
- Bad Loans Count: People that apply for educational and small business purposed tend to have a higher risk of being a bad loan. (% wise)
- Most frequent Purpose: The reason that clients applied the most for a loan was to consolidate debt.
- Less frequent purpose: Clients applied less for educational purposes for all three income categories.
- Interest Rates: In all reasons for application except (medical, small business and credi card), the low income category has a higher interest rate. Something that could possibly explain this is the amount of capital that is needed from other income categories that might explain why the low income categories interest rate for these puposes are lower.
- Bad/Good Ratio: Except for educational purposes (we see a spike in high income this is due to the reasons that only two loans were issued and one was a bad loan which caused this ratio to spike to 50%.), but we can see that in all other purposed the bad good ratio is lower the higher your income category.
Data is oversampled using the SMOTE technique prior to performing the Logistic Regression.
Feature Engineering & Neural Network
There are features that are redundant (as show in the beginning of this kernel in the distribution subplots) having no effect towards the "loan_condition" label so we need to drop these features.
Use StrattifiedShuffleSplit to have approximately the same ratio of bad loans compared to good loans in both training and testing data. Remember that over 92% of the loans are considered good loans so it is important to have this same ration across training and testing sets.
Scale numeric features and encode categorical features from our dataframe.
Run our Neural Network containing the number of inputs, 2 hidden layers (first: 15 nodes, second: 5 nodes) and the number of outputs which is equivalent to 2.