Default Predictability with Lending Club

Avatar
Posted on May 28, 2020

Introduction

Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors in order to earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides is the platform that bridges investors and borrowers. 

Data Summary

The dataset used is the complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The dataset contains over 890,00 loan observations (rows) and over 75 features (columns).  Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. 

First Observations

  • Most of the loans issued were in the range of $10,000 - $20,000.
  • The year of 2015 was the year were most loans were issued.
  • Loans were issued in an incremental manner, possibly due to an economic recovery in the U.S. economy. 
  • The loans applied by potential borrowers, the amount issued to the borrowers and the amount funded by investors are similarly distributed, meaning that it is most likely that qualified borrowers are going to get the loan they had applied for.

 

Credit Scores vs. Loan Grades

Credit scores are important metrics for the evaluation of overall level of risk. In this section we will analyze the level of risk as a whole and how many loans were bad loans by grade type from customer’s credit score.

What we need to know:

i.) The lower the grade of the credit score, the higher the risk for investors.

ii.) There are different factors that influence on the level of risk of the loan.

Determination of a Bad Loan

The primary factors that increase overall loan risk for investors are  low annual income, high debt to income, high interest rates, and low grade. The types of bad loans in the last year are having a tendency to decline, except for late payments (might indicate an economical recovery). 

Other notable points are: 

  • Mortgage was the variable from the home ownership column that used the highest amount borrowed within loans that were considered to be bad.
  • There is a slight increase on people who have mortgages that are applying for a loan.
  • People who have a mortgage (depending on other factors as well within the mortgage) are more likely to ask for

Analysis by Region

The regional analysis of loans gives us an idea of geographic distribution (and not the risk level of loans based on differing categories). Below are the main points relating to regional analysis observed throughout the project. 

  • West and SouthEast regions have the most undesirable loan status, but just by a slightly higher percentage compared to the NorthEast region.
  • West and SouthEast had a higher percentage in most of Grade B "bad" loans
  • The NorthEast region had a higher percentage in Grace Period and does not meet Credit Policy loan status. (both are not considered as bad as default for instance)

Loans Issued by State

- California, Texas, New York and Florida are the states in which the highest amount of loans were issued. Not surprisingly, these four states also compose the largest share of the U.S. annual gross domestic product (GDP). 

- Interesting enough, all four states have ~ interest  rates of 13% which is at the same level of the average interest rate for all states (13.24%)

- California, Texas and New York are all above the average annual income (with the exclusion of Florida), this might give possible indication why most loans are issued in these states.

Analysis by Income Category

Distinct income categories were created in order to detect important patterns and do an in-depth analysis of this segment. Some important points to consider are: 

- Low income category: Borrowers that have an annual income lower or equal to $100,000.

- Medium income category: Borrowers that have an annual income higher than $100,000 but lower or equal to $200,000. 

- High income category: Borrowers that have an annual income higher than $200,000. 

  • Borrowers that made part of the high income category took higher loan amounts than people from low and medium income categories. Of course, people with higher annual incomes are more likely to pay loans with a higher amount. (First row to the left of the subplots)
  • Loans that were borrowed by the Low income category had a slightly higher change of becoming a bad loan. (First row to the right of the subplots)
  • Borrowers with High and Medium annual incomes had a longer employment length than people with lower incomes.(Second row to the left of the subplots)
  • Borrowers with a lower income had on average higher interest rates while people with a higher annual income had lower interest rates on their loans. (Second row to the right of the subplots)

Good/Bad Loan Summary

- Bad Loans Count: People that apply for educational and small business purposed tend to have a higher risk of being a bad loan. (% wise)

- Most frequent Purpose: The reason that clients applied the most for a loan was to consolidate debt.

- Less frequent purpose: Clients applied less for educational purposes for all three income categories.

- Interest Rates: In all reasons for application except (medical, small business and credi card), the low income category has a higher interest rate. Something that could possibly explain this is the amount of capital that is needed from other income categories that might explain why the low income categories interest rate for these puposes are lower.

- Bad/Good Ratio: Except for educational purposes (we see a spike in high income this is due to the reasons that only two loans were issued and one was a bad loan which caused this ratio to spike to 50%.), but we can see that in all other purposed the bad good ratio is lower the higher your income category.

Logistic Regression

Data is oversampled using the SMOTE technique prior to performing the Logistic Regression. 

Feature Engineering & Neural Network

There are features that are redundant (as show in the beginning of this kernel in the distribution subplots) having no effect towards the "loan_condition" label so we need to drop these features.

Use StrattifiedShuffleSplit to have approximately the same ratio of bad loans compared to good loans in both training and testing data. Remember that over 92% of the loans are considered good loans so it is important to have this same ration across training and testing sets.

Scale numeric features and encode categorical features from our dataframe.

Run our Neural Network containing the number of inputs, 2 hidden layers (first: 15 nodes, second: 5 nodes) and the number of outputs which is equivalent to 2.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp