Lending Peer to Peer Default Predictability for Lending Club

Posted on May 28, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors in order to earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides is the platform that bridges investors and borrowers. 

Data Summary

The dataset used is the complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The dataset contains over 890,00 loan observations (rows) and over 75 features (columns).  Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. 

First Observations

  • Most of the loans issued were in the range of $10,000 - $20,000.
  • The year of 2015 was the year were most loans were issued.
  • Loans were issued in an incremental manner, possibly due to an economic recovery in the U.S. economy.¬†
  • The loans applied by potential borrowers, the amount issued to the borrowers and the amount funded by investors are similarly distributed, meaning that it is most likely that qualified borrowers are going to get the loan they had applied for.



Credit Scores vs. Loan Grades

Credit scores are important metrics for the evaluation of overall level of risk. In this section we will analyze the level of risk as a whole and how many loans were bad loans by grade type from customer’s credit score.

What we need to know:

i.) The lower the grade of the credit score, the higher the risk for investors.

ii.) There are different factors that influence on the level of risk of the loan.


Determination of a Bad Loan

The primary factors that increase overall loan risk for investors are  low annual income, high debt to income, high interest rates, and low grade. The types of bad loans in the last year are having a tendency to decline, except for late payments (might indicate an economical recovery). 

Other notable points are: 

  • Mortgage¬†was the variable from the home ownership column that used the highest amount borrowed within loans that were considered to be bad.
  • There is a slight¬†increase¬†on people who have mortgages that are applying for a loan.
  • People who have a mortgage (depending on other factors as well within the mortgage) are more likely to ask for

Analysis by Region

The regional analysis of loans gives us an idea of geographic distribution (and not the risk level of loans based on differing categories). Below are the main points relating to regional analysis observed throughout the project. 

  • West¬†and¬†SouthEast¬†regions have the most undesirable loan status, but just by a slightly higher percentage compared to the¬†NorthEast¬†region.
  • West¬†and¬†SouthEast¬†had a higher percentage in most of Grade B "bad" loans
  • The¬†NorthEast¬†region had a higher percentage in¬†Grace Period¬†and does not meet Credit Policy¬†loan status. (both are not considered as bad as¬†default for instance)

Loans Issued by State

- California, Texas, New York and Florida are the states in which the highest amount of loans were issued. Not surprisingly, these four states also compose the largest share of the U.S. annual gross domestic product (GDP). 

- Interesting enough, all four states have ~ interest  rates of 13% which is at the same level of the average interest rate for all states (13.24%)

- California, Texas and New York are all above the average annual income (with the exclusion of Florida), this might give possible indication why most loans are issued in these states.

Analysis by Income Category

Distinct income categories were created in order to detect important patterns and do an in-depth analysis of this segment. Some important points to consider are: 

- Low income category: Borrowers that have an annual income lower or equal to $100,000.

- Medium income category: Borrowers that have an annual income higher than $100,000 but lower or equal to $200,000. 

- High income category: Borrowers that have an annual income higher than $200,000. 

  • Borrowers that made part of the¬†high income category¬†took higher loan amounts than people from¬†low¬†and¬†medium income categories.¬†Of course, people with higher annual incomes are more likely to pay loans with a higher amount. (First row to the left of the subplots)
  • Loans that were borrowed by the¬†Low income category¬†had a slightly higher change of becoming a bad loan. (First row to the right of the subplots)
  • Borrowers with¬†High¬†and¬†Medium¬†annual incomes had a longer employment length than people with lower incomes.(Second row to the left of the subplots)
  • Borrowers with a lower income had on average¬†higher interest rates¬†while people with a higher annual income had¬†lower interest rates¬†on their loans. (Second row to the right of the subplots)

Good/Bad Loan Summary

- Bad Loans Count: People that apply for educational and small business purposed tend to have a higher risk of being a bad loan. (% wise)

- Most frequent Purpose: The reason that clients applied the most for a loan was to consolidate debt.

- Less frequent purpose: Clients applied less for educational purposes for all three income categories.

- Interest Rates: In all reasons for application except (medical, small business and credi card), the low income category has a higher interest rate. Something that could possibly explain this is the amount of capital that is needed from other income categories that might explain why the low income categories interest rate for these puposes are lower.

- Bad/Good Ratio: Except for educational purposes (we see a spike in high income this is due to the reasons that only two loans were issued and one was a bad loan which caused this ratio to spike to 50%.), but we can see that in all other purposed the bad good ratio is lower the higher your income category.

Logistic Regression

Data is oversampled using the SMOTE technique prior to performing the Logistic Regression. 


Feature Engineering & Neural Network

There are features that are redundant (as show in the beginning of this kernel in the distribution subplots) having no effect towards the "loan_condition" label so we need to drop these features.

Use StrattifiedShuffleSplit to have approximately the same ratio of bad loans compared to good loans in both training and testing data. Remember that over 92% of the loans are considered good loans so it is important to have this same ration across training and testing sets.

Scale numeric features and encode categorical features from our dataframe.

Run our Neural Network containing the number of inputs, 2 hidden layers (first: 15 nodes, second: 5 nodes) and the number of outputs which is equivalent to 2.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI