Data Analysis on Credit Card Default Detection

Posted on Oct 28, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Motivation

In 2005, a Taiwanese bank conducted a data study on the likelihood of clients defaulting on their loan payments. The motivation behind the study was the increase in amounts of credit being offered by banks to customers, regardless of their repayment capabilities. This led to customers accumulating significant amounts of debt, which in turn resulted in defaults.

The goal was to use basic information about customers along with their past repayment history to predict their likelihood of default. Our objective is to use the previous 6 months of repayment history to try and predict whether the customer will default the following month.

Data Analysis on Credit Card Default Detection

Dataset

The dataset used was collected by a Taiwanese bank in October 2005 and can be downloaded from UCI's Machine Learning Repository.

  • Marriage: Marital status
  • Education: Level of education
  • Limit balance: Amount of credit (NT Dollars)
  • Payment status (month): Current repayment status
  • Bill statement (month): The amount of bill statements (NT Dollars)
  • Previous payment (month): Previous payment amount (NT Dollars)
  • Default payment next month: The target variable indicating whether the customer defaulted on the payment the following month.

Exploratory Data Analysis

The correlation matrix shows us that age appears to be uncorrelated to other features. Multicollinearity would be a concern, however, it has a negligible effect on the tree based models we intend on using.

Data Analysis on Credit Card Default Detection

CORRELATION MATRIX

The pairplot doesn't show much of a difference in the shape of the distribution per gender. We can also see a decrease in limt balances as age starts to increase beyond 55.

Data Analysis on Credit Card Default Detection
PAIRPLOT

Data on Model Selection

CLASSIFICATION

We will fit tree based classifiers for our binary classification problem. A confusion matrix will help summarize all possible combinations of the predicted values as opposed to the actual target in the form of:

  • ​True positive (TP): The model predicts a default, and the client defaulted.​
  • False positive (FP): The model predicts a default, but the client did not default.
  • ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​True negative (TN): The model predicts a good customer, and the client​ did not default. ​​​​​​​
  • False negative (FN): The model predicts a good customer, but the client defaulted.

We can use these values to create additional evaluation criterias for our model.

  • Accuracy: Measures the model's overall ability to correctly predict the class of the observation.
  • Precision: Out of all default predictions, how many observations indeed defaulted.
  • Recall: Out of all positive cases, how many were predicted correctly.
  • Specificity: Measures what fraction of negative cases actually did not default.
  • F-1 Score: A harmonic average of precision and recall.

The importance of understanding these metrics is critical for the proper evaluation of our model's performance. Optimizing for a specific criteria could depend on the bank's priorities. In terms of risk management, the bank may prefer to mitigate risk by declining more applications, as opposed to taking on riskier loans, which may result in larger losses.

In this case, we would try to achieve as high recall as possible. This will achieve fewer false negatives, at the cost of more false positives. Conversely, if the bank believes it can aggressively hand out loans and still profit regardless of additional defaults, then they can aim for higher precision. This will get fewer false positives, at the cost of more false negatives. Ultimately, the metric on which we try to optimize should be selected based on the use case.

Models

We will start with a basic decision tree model and follow with a more sophisticated random forest model.

BASE TREE
RANDOM FOREST

Above, are the results from a base tree model and a random forest model. We see a drastic increase in accuracy and precision when we apply a more sophisticated model.

Hyperparameter Tuning

We will apply a grid search to tune the hyperparameters of the model in order to achieve better performance. The idea is to create a grid of possible hyperparameter combinations and train the model using each one of them. The search will help us identify the optimal hyperparameter within the grid.

TUNED BASE TREE
TUNED RANDOM FOREST

Tuning the hyperparameters led to an increased accuracy and precision from our previous models. This was the result of optimizing our model with the help of an exhaustive grid search. 

Conclusion

MODEL SUMMARY

We decided to select the best performing decision tree model based on recall: the percentage of all defaults correctly identified by the model. This evaluation metric makes the most sense due to target imbalance i.e the ratio of our default to non-default value. To predict defaults, we decided that we could accept the cost of more false positives, in return for reducing the number of false negatives.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI