Loan Default Detection

Shamroz Qureshi
Posted on Oct 28, 2020


In 2005, a Taiwanese bank conducted a study on the likelihood of clients defaulting on their loan payments. The motivation behind the study was the increase in amounts of credit being offered by banks to customers, regardless of their repayment capabilities. This led to customers accumulating significant amounts of debt, which in turn resulted in defaults.

The goal was to use basic information about customers along with their past repayment history to predict their likelihood of default. Our objective is to use the previous 6 months of repayment history to try and predict whether the customer will default the following month.


The dataset used was collected by a Taiwanese bank in October 2005 and can be downloaded from UCI's Machine Learning Repository.

  • Marriage: Marital status
  • Education: Level of education
  • Limit balance: Amount of credit (NT Dollars)
  • Payment status (month): Current repayment status
  • Bill statement (month): The amount of bill statements (NT Dollars)
  • Previous payment (month): Previous payment amount (NT Dollars)
  • Default payment next month: The target variable indicating whether the customer defaulted on the payment the following month.

Exploratory Data Analysis

The correlation matrix shows us that age appears to be uncorrelated to other features. Multicollinearity would be a concern, however, it has a negligible effect on the tree based models we intend on using.


The pairplot doesn't show much of a difference in the shape of the distribution per gender. We can also see a decrease in limt balances as age starts to increase beyond 55.


Model Selection


We will fit tree based classifiers for our binary classification problem. A confusion matrix will help summarize all possible combinations of the predicted values as opposed to the actual target in the form of:

  • ​True positive (TP): The model predicts a default, and the client defaulted.​
  • False positive (FP): The model predicts a default, but the client did not default.
  • ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​True negative (TN): The model predicts a good customer, and the client​ did not default. ​​​​​​​
  • False negative (FN): The model predicts a good customer, but the client defaulted.

We can use these values to create additional evaluation criterias for our model.

  • Accuracy: Measures the model's overall ability to correctly predict the class of the observation.
  • Precision: Out of all default predictions, how many observations indeed defaulted.
  • Recall: Out of all positive cases, how many were predicted correctly.
  • Specificity: Measures what fraction of negative cases actually did not default.
  • F-1 Score: A harmonic average of precision and recall.

The importance of understanding these metrics is critical for the proper evaluation of our model's performance. Optimizing for a specific criteria could depend on the bank's priorities. In terms of risk management, the bank may prefer to mitigate risk by declining more applications, as opposed to taking on riskier loans, which may result in larger losses. In this case, we would try to achieve as high recall as possible. This will achieve fewer false negatives, at the cost of more false positives. Conversely, if the bank believes it can aggressively hand out loans and still profit regardless of additional defaults, then they can aim for higher precision. This will get fewer false positives, at the cost of more false negatives. Ultimately, the metric on which we try to optimize should be selected based on the use case.


We will start with a basic decision tree model and follow with a more sophisticated random forest model.


Above, are the results from a base tree model and a random forest model. We see a drastic increase in accuracy and precision when we apply a more sophisticated model.

Hyperparameter Tuning

We will apply a grid search to tune the hyperparameters of the model in order to achieve better performance. The idea is to create a grid of possible hyperparameter combinations and train the model using each one of them. The search will help us identify the optimal hyperparameter within the grid.


Tuning the hyperparameters led to an increased accuracy and precision from our previous models. This was the result of optimizing our model with the help of an exhaustive grid search. 



We decided to select the best performing decision tree model based on recall: the percentage of all defaults correctly identified by the model. This evaluation metric makes the most sense due to target imbalance i.e the ratio of our default to non-default value. To predict defaults, we decided that we could accept the cost of more false positives, in return for reducing the number of false negatives.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp