Data Analysis on Credit Card Default Detection
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
In 2005, a Taiwanese bank conducted a data study on the likelihood of clients defaulting on their loan payments. The motivation behind the study was the increase in amounts of credit being offered by banks to customers, regardless of their repayment capabilities. This led to customers accumulating significant amounts of debt, which in turn resulted in defaults.
The goal was to use basic information about customers along with their past repayment history to predict their likelihood of default. Our objective is to use the previous 6 months of repayment history to try and predict whether the customer will default the following month.
The dataset used was collected by a Taiwanese bank in October 2005 and can be downloaded from UCI's Machine Learning Repository.
- Marriage: Marital status
- Education: Level of education
- Limit balance: Amount of credit (NT Dollars)
- Payment status (month): Current repayment status
- Bill statement (month): The amount of bill statements (NT Dollars)
- Previous payment (month): Previous payment amount (NT Dollars)
- Default payment next month: The target variable indicating whether the customer defaulted on the payment the following month.
Exploratory Data Analysis
The correlation matrix shows us that age appears to be uncorrelated to other features. Multicollinearity would be a concern, however, it has a negligible effect on the tree based models we intend on using.
The pairplot doesn't show much of a difference in the shape of the distribution per gender. We can also see a decrease in limt balances as age starts to increase beyond 55.
Data on Model Selection
We will fit tree based classifiers for our binary classification problem. A confusion matrix will help summarize all possible combinations of the predicted values as opposed to the actual target in the form of:
- True positive (TP): The model predicts a default, and the client defaulted.
- False positive (FP): The model predicts a default, but the client did not default.
- True negative (TN): The model predicts a good customer, and the client did not default.
- False negative (FN): The model predicts a good customer, but the client defaulted.
We can use these values to create additional evaluation criterias for our model.
- Accuracy: Measures the model's overall ability to correctly predict the class of the observation.
- Precision: Out of all default predictions, how many observations indeed defaulted.
- Recall: Out of all positive cases, how many were predicted correctly.
- Specificity: Measures what fraction of negative cases actually did not default.
- F-1 Score: A harmonic average of precision and recall.
The importance of understanding these metrics is critical for the proper evaluation of our model's performance. Optimizing for a specific criteria could depend on the bank's priorities. In terms of risk management, the bank may prefer to mitigate risk by declining more applications, as opposed to taking on riskier loans, which may result in larger losses.
In this case, we would try to achieve as high recall as possible. This will achieve fewer false negatives, at the cost of more false positives. Conversely, if the bank believes it can aggressively hand out loans and still profit regardless of additional defaults, then they can aim for higher precision. This will get fewer false positives, at the cost of more false negatives. Ultimately, the metric on which we try to optimize should be selected based on the use case.
We will start with a basic decision tree model and follow with a more sophisticated random forest model.
Above, are the results from a base tree model and a random forest model. We see a drastic increase in accuracy and precision when we apply a more sophisticated model.
We will apply a grid search to tune the hyperparameters of the model in order to achieve better performance. The idea is to create a grid of possible hyperparameter combinations and train the model using each one of them. The search will help us identify the optimal hyperparameter within the grid.
TUNED BASE TREE
TUNED RANDOM FOREST
Tuning the hyperparameters led to an increased accuracy and precision from our previous models. This was the result of optimizing our model with the help of an exhaustive grid search.
We decided to select the best performing decision tree model based on recall: the percentage of all defaults correctly identified by the model. This evaluation metric makes the most sense due to target imbalance i.e the ratio of our default to non-default value. To predict defaults, we decided that we could accept the cost of more false positives, in return for reducing the number of false negatives.