Ranking #1 on Kaggle for Predicting Consumer Debt Default
Introduction to Predicting Credit Default
The goal of this challenge is two-pronged, to build a model that borrowers can use to help make the best financial decisions, and for the lenders to foresee when a borrower might lead into financial distress. The intent is to improve on the state of the art in credit scoring by predicting probability of credit default in the next two years. This is an extremely complex and difficult Kaggle challenge, as banks and various lending institutions are constantly looking and fine tuning the best credit scoring algorithms out there. It is the model banks use to determine whether or not a loan should be granted. With this information, banks can make better decisions and borrowers can also do better financial planning to mitigate possible default status in the future.
This challenge allowed the team to employ the best of ensemble models and algorithms (like XGBoost, Gradient Boosting, Random Forest, Restricted Boltzman Machine Neural Networks, Adaboost) and advanced stacking techniques and voting classifiers to accurately predict the probability of default.
We are measured and ranked strictly using ROC AUC Curves. We also used a very disciplined Agile process to ensure that we execute critical path tasks in parallel and in small chunks. We fail fast and iterate fast to maximize productivity. Using sophisticated Bayesian Optimizers, we were able to garner the best hyperparameters settings to cut down on overall testing and cross-validation time, supercharging our process to climb the ranking and achieve the highest AUC score possible.
With the suite of tools we utilized, the synergies and teamwork, and a process that helped boost our productivity, we not only reach the top tiers of the scoreboard, we actually smashed the #1 score and surpassed it to garner the top position in the challenge.
A SWOT analysis allows our team to organize our thoughts and focus on leveraging our greatest strengths, understand our weaknesses, take advantage of opportunities, and have awareness of any threats to our undertaking.
SWOT allows us to put ourselves on the right track right away, and saves us from a lot of headaches later on.
Strengths - Leveraging what we already know
- Experience to leverage advanced stacking techniques and Agile
- A synergistic team with very complementary skills and experiences
- Lessons learned from a previous Kaggle challenge
- Experience in the use of Agile process to parallel track work queues
Weaknesses - Areas that we need to improve upon
- Extreme time constraints would limit the scope/depth of exploration
- Unfamiliarity with new tools and models will limit firepower
- Learning while executing will slow down the entire process
- Sparse resources on some of the newer technologies employed
Opportunities - Chance to apply & learn from experience
- Learn how to strategize, optimize, and fine-tune models, algorithms, and parameters to achieve the best default prediction possible
- Gain real time experience using Bayesian Optimizers
- Explore Deep Learning (Theano/Keras) in predicting default
Threats - Risks that we need to mitigate and manage
- Small dataset size presents tremendous challenges on generalization that would impact modeling and ultimately, prediction accuracy
- Extremely tight tolerances in the AUC scores in the 10,000th decimal
- Top 5% target goal presents a high risk – no idea how feasible this is
We used an “Agile Process”, created by Bernard, as the pipeline for this project. It is enhanced specifically for machine learning, as it needs to incorporate a big part of how data is being processed, modeled, and tested alongside more traditional development life cycles.
The Agile Process leverages the concept of "chunking" out work tasks in rapid fashion, incorporating parallel tracks to fail fast and iterate fast, exert powerful momentum to maximize productivity, and was one of the main factors that got us to quickly achieve the #1 ranking within a short two weeks.
What is the Agile Process?
The Agile Process is a standard industry-wide traditional software development and engineering life cycle, based on iterative development, where requirements and solutions evolve through collaboration between self-organizing cross-functional teams. The one adapted for this undertaking is a modified form of Agile specifically designed for use with machine learning initiatives. It has similar components but is executed mostly in parallel instead of sequentially. The components of this process include data preprocessing (including missing value imputation), exploratory data analysis (e.g. univariate distribution, bivariate distribution, correlation analysis), feature engineering (e.g. adding features, removing features, PCA), algorithm selection (i.e. supervised), hyperparameter tuning, model fitting, model evaluation, model re-engineering, cross-validation testing, prediction, and finally, submission. Unlike traditional sequential Machine Learning pipeline where models are selected and tuned one at a time and model fitting can’t start before the imputation method is decided, the Agile Process fully takes advantage of the fact that multiple people are working on the project by having missing data imputation, feature engineering and model fitting run in parallel.
Why Agile Process?
Under our team lead, Bernard’s guide, we had the entire Agile Process in mind from day one. Because most of the components are done in parallel, and each one of us had very specific tasks assigned, nobody was waiting around for someone to finish their work. We can evaluate models and tactical approaches at high speed and be able to fail fast and iterate fast while learning what works and what doesn't. What this translates into in terms of results is that by the middle of the first week, we already ranked in the top 100 of the leaderboard. By the end of the first week, we were at #2 position, which left us a whole week to get to #1.
Exploratory Data Analysis
From the missing plot below, we can see that the variables ‘Debt Ratio’ and ‘Number of Dependents’ have ~20% and ~3% missing rate respectively. We tried different imputation methods, including KNN, mean, random and median. However, we found that the models actually perform better without imputation.
We also did PCA. It suggests that choosing 5 principal components explains only 66% of the total variance. Since there are only 10 features, PCA may not work well for us.
One interesting property of this data set is that almost every column except ‘age’ has some problems with the outliers. For example, columns ‘Number of 30-59 Days Past Due Not Worse’, ‘Number of 60-89 Days Past Due Not Worse’ and ‘Number of 90 Days Past Due’ all have around 0.2% of outliers which equal to either 98 or 96. These numbers are actually codes entered in the survey to represent that the consumer is not willing to reveal the information and these affect the univariate distribution a lot. Variable ‘RevolvingUtilizationOfUnsecuredLines’ is the total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. Thus the normal range of this variable lies between 0 and 1. However, it has outliers larger than 5000. Thus we must handle this problem.
For each column, we detect the 95% percentile-based outlier, median-based outlier and standard deviation outlier and do a majority vote. If two out of the three methods agree that a certain point is an outlier, we identify this point to be an outlier. Then we replace the outlier either by the median of the column or the least outlier depending on the column.
Filtering the outlier did change the structure of data. Previously, the three delinquency variables are highly correlated. After the filtering, the correlation no long exists as shown by the below snapshot.
Below is the variable importance plot which shows that ‘RevolvingUtilizationOfUnsecuredLines’, ‘DebtRatio’ and ‘MonthlyIncome’ are the three most important variables.
This is important for the feature engineering. Below is the work flow of how we did feature engineering.
We tried different methods. We combine some columns to create a new column and drop the old columns. For example, Debt Ratio multiplied by Monthly Income gives us Monthly Debt. We also assign weights to the delinquency variable. For each delinquency variable, we developed a simple logistic regression and used the resulted R^2 divided by the sum of all three R^2 as the weight. In the end, we created 7 training datasets and 7 test datasets. These datasets improved Naïve Bayes and Logistic regression model AUC score from 0.7~ to 0.85~. However, for tree based models, the datasets did not help that much.
Model Execution Strategy
There were 4 models that were built and evaluated for predictive accuracy as a part of this challenge. The team employed a parallel tracking process where all models were built simultaneously and every time a better parameter setting was found using automated optimization, those parameters were fed into the entire process cycle and synergies were gained instantaneously.
Single and Ensemble models: As a first step in building the models, Logistic Regression and Naive Bayes models were trained and the accuracy (Area Under the Curve - AUC score) was found to be ~ 0.7 for both models. The above mentioned single models provided a good baseline for comparing the scores for more sophisticated models such as Stacking, Voting and the Blended models. Gradient Boosting and Random Forest models were evaluated as a part of Ensemble models and the AUC scores were documented.
Advanced Stacking Models: Stacking model combines the Base Classifiers in a non-linear fashion. The combining task, called a meta learner, integrates the independently computed base classifiers into a higher level classifier, called a meta classifier, by learning over a meta-level training set. The 2 layer Stacking model consisted of Gradient Boosting and XGBoost as Base classifiers whose probabilities were then fed to a Logistic Regression model that acted as a Meta classifier. The Stacking models helped enhance the score to ~0.8685 ranking at #30 on Kaggle Leaderboard.
Voting Classifier Models: Voting models achieve the classification of an unlabeled instance according to the class that obtains the highest number of votes. They use weighted average method on the individual classifier's probabilities to calculate the final output probability for a prediction. Though the team started with a 2 classifiers initially, the final model consisted of 12 classifiers including 7 Gradient Boosters, 1 Naive Bayes, 3 Random Forests and 1 AdaBoost classifier. The number of classifiers were gradually increased based on the test results that contributed towards an increase in cross validation scores. The Voting models helped increase the AUC score to ~0.869 ranking at #8 on Kaggle Leaderboard.
Voting and Stacking Blended models: The model that gave the much needed boost to surpass the existing rank #1 was a blend of Voting and Stacking models. This model consisted of a blend of 2 Gradient Boosters, 2 Random Forests, 1 AdaBoost and 1 Restricted Boltzman Machine(RBM) and the addition of the Neural Network algorithm (RBM) helped enhance the score to 0.869574 and placed Team Eigenauts at the top #1 position on Kaggle Leaderboard.
What is Bayesian Optimization used for?
Almost every Machine Learning algorithm include some number of hyperparameters, also called tuning parameters. These hyperparameters are different from regular parameters in that they are not part of a model, therefore not tuned automatically through model fitting; they are tuned using a separate process altogether. Some examples of hyperparameters include the regularization parameter lambda for ridge and lasso regression, the C term for Support Vector Machine, the number of trees for tree-based methods (e.g. Random Forest, Gradient Boosting Machine).
There are currently 4 types of hyperparameter optimization methods: 1. grid search 2. random search 3. gradient-based optimization 4. Bayesian Optimization Out of these 4 methods, we were able to test grid search, random search and bayesian optimization. We found that Bayesian Optimization is the most efficient, automated approach.
Why is Bayesian Optimization more efficient than grid search and random search?
When searching for the best hyperparameter value, a few things are needed. First and foremost, any algorithm needs an objective function to maximize over or a loss function to minimize over. Then, a search space needs to be identified, usually through an upper bound and a lower bound. There may also be algorithm-specific arguments, such as step size for grid search.
Grid search is perhaps the most widely used hyperparameter search algorithm because of its simplicity. Grid search finds the best value by going through every point in search space, and returns the value that produces the highest (or lowest value if the objective function is a loss function) of the objective function. By having a wide search space and small step size, grid search is guaranteed to find the global maximum (or minimum). However, a big problem of grid search is that it is very computationally expensive, especially when there are many hyperparameters that need to be tuned (e.g. ~8 for Random Forest). Therefore, when people use grid search to find the best set of hyperparameters, often what happens is the user first tries using a wide search space and a large step size to get an idea of where the global maximum (or minimum) might lie. Then, one would gradually shrink the search space as well as the step size to obtain values at a much more granular level. Although this approach will decrease the run time, because the objective function is often times non-convex (see Figure 1), it is highly likely that the user ends up missing the global maximum (or minimum) because they found a local maximum (or minimum) at the first try.
Random search uses a similar idea as grid search, only instead of checking every value from the lower bound to upper bound, it randomly samples points in the search space. The rationale is that if the number of randomly-sampled points is large enough, the global maximum (or minimum) will also be found (or some nearby value). By randomly sampling points in the search space, random search is usually faster than grid search. But similar to the faster version of grid search (non-automated version), the result is not guaranteed.
Bayesian optimization tries to find the set of values that produces the global maximum quite differently from grid search and random search. While grid search and random search ignores information obtained from previously sampled points when they are evaluating new points, Bayesian Optimization fully utilizes the information. The way bayesian optimization works is that it finds the value that gives rise to the global maximum (or minimum) by learning the shape of the objective function. It learns the shape by assuming a random function with some prior distribution. At each evaluation of the objective function with the newly-sampled point, it uses the information to update its prior distribution, which becomes the posterior distribution of the objective function. The algorithm then samples point at where the posterior distribution believes a global maximum (or minimum) is most likely be located at.
One major caveat of Bayesian Optimization is that once it finds a local maximum (or minimum), it will keep sampling points at that region, so it is easy to be trapped in a local maximum (or minimum). To help alleviate this problem, the Bayesian Optimization algorithm aims to strike a balance between exploration and exploitation.
Exploration is to sample points from regions that have not been sampled. Exploitation is to sample points at where the global maximum (or minimum) is most likely to be, according to the posterior distribution.
The package we used for Bayesian Optimization is a Python package called “bayes_opt”. To see how “bayes_opt” balances exploration and exploitation, watch the video below.
When might Bayesian Optimization fail to return the best values?
Bayesian optimization, although better than grid search and random search, is no magic, meaning it still needs to configured correctly. Based on our experience, it is crucial to have a good ratio between the number of iterations (i.e. points sampled) and the size of search space. Let’s take an extreme case to illustrate this point. Imagine you are tuning 2 hyperparameters ranging from 1 to 1000 for each of them, and you set the number of iterations to 2. The algorithm will almost certainly return some bad value because it hasn’t had time to learn the objective function well.
Path to Top Score
ROC/AUC Results Curve
Plotting the Receiver Operating Characteristic(ROC) curve helped visualize the performance of the binary classifier in predicting the probability of Default Vs No Default. This graph was plotted for the final Blended model that produced the best result in the Kaggle Private Leaderboard. In a ROC curve, the true positive rate (Sensitivity) is plotted as a function of the false positive rate (100-Specificity) for different cut-off points of a parameter.
The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (experiencing a Default Vs No Default). Based on this plot, the area under the curve for our best model was found to be ~0.89. This shows that a randomly selected observation from the training data set with a '1' (likely to Default) label has a score larger than that for a randomly chosen observation from the group with the '0' (not likely to default) label 89% of the time.
Results and Findings
The following are our results and findings based on the Feature engineering and Predictive modeling performed on the Consumer Debt Default data:
- The models seemed to perform better without the missing values being imputed than when imputing the missing values
- Stacking and Voting and the combination of the two models generally tend to have very high predictive power compared to plain Ensemble models
- Feature Engineering improved the AUC score for single models (Naive Bayes and Logistic Regression) from ~0.7 to ~0.85 but did not have much impact on the Tree based methods
- The incremental increase in the predictive accuracy (AUC) is of the order of 0.0001 as we move towards the top of the Kaggle leaderboard (top 2%) and the tuning gets a lot harder
Lessons Learned and Insights
This project helped the team learn some invaluable lessons pertaining to Machine Learning, Predictive Modeling and what it takes to achieve #1 rank in a highly competitive Data Science challenge:
- Hyperparameter tuning is a very time consuming process and it is better to have the team split this effort and work in parallel
- Cross Validation is very critical and it is worth spending time testing the impact of various folds on the model accuracy
- The model needs to be tuned at a much more granular level as the dataset gets smaller in size (both in terms of number of features and observations)
- Following an Agile parallel process has continued to be a proven factor for maximizing success
In terms of what we would have liked to do more of, a few things come to mind. These tasks can be addressed as a part of future scope of enhancements.
- Tune the parameters for Deep Learning using Theano / Keras and compare the predictive accuracy and performance against Stacking / Voting models
- Explore the possibility of adding new polynomial and transformed features, and evaluate the predictive accuracy.
We have met our objectives in this Machine Learning Challenge, being able to apply various models, algorithms, and strategies to achieve relatively good predictions. We have obtained the final private leaderboard score of 0.869574, achieving top #1 position out of 925 participants on this Kaggle challenge. This is a huge accomplishment given that we only had 2 weeks to prepare and deliver results on this challenge.