A Walk-Through The Kaggle Allstate Insurance Claim Challenge
Can you predict my financial pain? In a way, Allstate was asking this question via a Kaggle Challenge they sponsered at the end of 2016. Specifically, the challenge was to predict the cost of claims. It is a quest for a model of ever increasing accuracy.
So let's dive in and begin a walk-through of the process in seeking an answer to this predictive mission.
The Data
Allstate provided a heavily anonymized training and test set. Let's take a look.
[code language="r"] dim(train) [1] 188318 132 sum(is.na(train)) [1] 0 as.data.frame(table(sapply(train, class))) Var1 Freq 1 factor 116 2 integer 1 3 numeric 15 [/code]
We have 188318 complete observations comprised of 116 categorical variables, an id integer variable, and 15 continuous variables ( 14 are features and the last is our ultimate response variable, loss).
Let's take a closer look at our 14 continuous features.
[code language="r"] sapply(train.continuous, function(cl) list(means=mean(cl,na.rm=TRUE), sds=sd(cl,na.rm=TRUE))) id cont1 cont2 cont3 cont4 cont5 cont6 cont7 cont8 means 294136 0.4938614 0.5071884 0.4989185 0.4918123 0.4874277 0.4909445 0.4849702 0.4864373 sds 169336.1 0.1876402 0.2072017 0.2021046 0.2112922 0.2090268 0.2052726 0.1784502 0.1993705 cont9 cont10 cont11 cont12 cont13 cont14 loss means 0.4855063 0.4980659 0.493511 0.4931504 0.4931376 0.495717 3037.338 sds 0.1816602 0.1858767 0.2097365 0.2094266 0.2127772 0.2224875 2904.086 [/code]
With all the primary features having an approximate .5 mean and .2 standard deviation, it appears the data set has already been pre-processed. The average loss is coming in at just a bit over $3000.00. How do the histograms of these features look?
And let's also check the skewness.
[code language="r"] apply(train.continuous[,-c(1)], 2, skewness) cont1 cont2 cont3 cont4 cont5 cont6 cont7 cont8 0.51641579 -0.31093630 -0.01000212 0.41608940 0.68161158 0.46120692 0.82603967 0.67662329 cont9 cont10 cont11 cont12 cont13 cont14 loss 1.07241164 0.35499529 0.28081695 0.29198743 0.38073614 0.24867013 3.79489792 [/code]
Loss' skewness value of ~3.79 looks out of range for the average -2 to +2 acceptable range (George & Mallery, 2010) . What does its density plot look like.?
Ok, so the data is skewed right. We'll do a log transform on it, and check it out again.
Looks good. How about correlations.
There are some strong correlations between features in the bottom right quadrant. Not great for trying any linear regression models without some feature work.
Now let's move on and examine our categorical variables.
Looks like the vast majority of these variables only have 2 levels. A couple have a more than a 100.
Feature Engineering
We have a little bit of feature engineering to do before we can run our selected models. We started with doing a log transform on the training set's loss feature. Now we have 116 categorical variables that need to be dummified. Combining the test and training sets for completeness, I tested 2 methods for transforming the categorical features.
The first was to cycle through the columns and to explicitly turn the categorical variables into integer vectors that represented the number of levels of that variable. This only took about 3.5 seconds on my laptop.
[code language="r"] for (f in features) { if (class(train_and_test[[f]]) == "character") { levels = unique(train_and_test[[f]]) train_and_test[[f]] = as.integer(factor(train_and_test[[f]], levels = levels)) } } [/code]
The second method was to utilize the caret library's dummyVars method. I actually terminated this procedure after 3 hours do to time constraints.
The Models
The really nice thing about using the Caret library is the uniform structure for handling nearly all phases of machine learning (data preprocessing, training, and scoring). Here is a sample of training/execution set up for my eXtreme Gradient Boosting model.
[code language="r"] train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, summaryFunction = maeOnLogMetric, search = "grid") xgbTree.tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1), nrounds = c(50, 75, 100), max_depth = 6:8, min_child_weight = c(2.0, 2.25, 2.5), colsample_bytree = c(0.3, 0.4, 0.5), gamma = 0, subsample = 1) xgbTree_model <- train(loss ~ ., data = sample.train, method = "xgbTree", metric = "MAE", trControl = train.control, tuneGrid = xgbTree.tune.grid ) xgbTree_predictions <- predict(xgbTree_model, sample.test) maeOnLog1pData(xgbTree_predictions,sample.test$loss) [/code]
All my models used the same training control: Monte Carlo 10 fold cross-validation, run 3 times with a grid search to examine a range of parameter values. Leaving off the self-defined tune.grid parameter lets Caret choose a range of values for a specific model. This is a great option to have when you are not entirely sure what values are most appropriate for the model and data. And at the very least, it provides a starting place to further tune a specific model to optimal values since that appears to be a significant component for predictive power.
I ultimately trained and tested four regression models: eXtreme Gradient Boosting, Glmnet, Linear Regression, and a Neural Network. The results:
MODEL | MAE | RMSE |
eXtreme Gradient Boosting (xgbTree) | 1306 | NA |
Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet) | 1467 | 2536 |
Linear Regression (lm) | 1329 | 2346 |
Neural Network (nnet) | 1300 | NA |
Although the testing results are not terrible with just a single model, it clearly shows they are insufficient in comparison to the results many others have achieved stacking models (less than 1100).
To Be Continued...
So clearly processing time of these models on large data sets can be very significant depending on your computational resources. Working on a mid-range laptop computer was excruciating, to say the least. So I upgraded to a 36 CPU Amazon EC2 instance. However, the real fun begins when you have the processing power to really experiment using different ensembles of models in various cloud processing environments. I also started testing on Microsoft's Azure ML platform using the straight R Script Execution method. It has potential, but its inputs can be fairly limited without some creative bundling. The same goes for supported libraries. Although the base supported set of libraries are fairly large, invariably it will be missing some very important ones you need (i.e. xgboost and data.table), and again you have to be very creative to bundle them in (basically build and zip up the libraries yourself).
With this in mind, my next steps are to use this dataset as a baseline to explore caret's modeling capacity and the equivalent in python's scikit-learn. Creating an efficient pipeline for the various models is a must. I intend to leverage several cloud platforms for testing and comparison.
Besides investigating the prediction accuracy various pipelines offer, as a software engineer, I'm especially curious how these ever-growing data sets can be efficiently modeled using parallel programming (a nice feature of XGBoost, and caret/doSNOW in general), algorithmic enhancements (Big O complexity), and other intrinsic/environmental features.
Let the fun begin.