A Walk-Through The Kaggle Allstate Insurance Claim Challenge

Posted on Jun 15, 2017

Can you predict my financial pain? In a way, Allstate was asking this question via a Kaggle Challenge they sponsered at the end of 2016. Specifically, the challenge was to predict the cost of claims. It is a quest for a model of ever increasing accuracy.

So let's dive in and begin a walk-through of the process in seeking an answer to this predictive mission.

The Data

Allstate provided a heavily anonymized training and test set. Let's take a look.

[code language="r"]

[1] 188318    132


[1] 0
as.data.frame(table(sapply(train, class)))

    Var1 Freq
1 factor  116
2 integer 1
3 numeric 15

We have 188318 complete observations comprised of 116 categorical variables, an id integer variable, and 15 continuous variables ( 14 are features and the last is our ultimate response variable, loss).

Let's take a closer look at our 14 continuous features.

[code language="r"]
sapply(train.continuous, function(cl) list(means=mean(cl,na.rm=TRUE), sds=sd(cl,na.rm=TRUE)))

      id       cont1     cont2     cont3     cont4     cont5     cont6     cont7     cont8 
means 294136   0.4938614 0.5071884 0.4989185 0.4918123 0.4874277 0.4909445 0.4849702 0.4864373
sds   169336.1 0.1876402 0.2072017 0.2021046 0.2112922 0.2090268 0.2052726 0.1784502 0.1993705
      cont9     cont10    cont11    cont12    cont13    cont14    loss 
means 0.4855063 0.4980659 0.493511  0.4931504 0.4931376 0.495717  3037.338
sds   0.1816602 0.1858767 0.2097365 0.2094266 0.2127772 0.2224875 2904.086

With all the primary features having an approximate .5 mean and .2 standard deviation, it appears the data set has already been pre-processed. The average loss is coming in at just a bit over $3000.00. How do the histograms of these features look?

And let's also check the skewness.

[code language="r"]
apply(train.continuous[,-c(1)], 2, skewness)

      cont1       cont2       cont3      cont4      cont5      cont6      cont7      cont8 
 0.51641579 -0.31093630 -0.01000212 0.41608940 0.68161158 0.46120692 0.82603967 0.67662329 
      cont9     cont10     cont11     cont12     cont13     cont14       loss 
 1.07241164 0.35499529 0.28081695 0.29198743 0.38073614 0.24867013 3.79489792 

Loss' skewness value of ~3.79 looks out of range for the average -2 to +2 acceptable range (George & Mallery, 2010) . What does its density plot look like.?

Ok, so the data is skewed right. We'll do a log transform on it, and check it out again.

Looks good. How about correlations.

There are some strong correlations between features in the bottom right quadrant. Not great for trying any linear regression models without some feature work.

Now let's move on and examine our categorical variables.


Looks like the vast majority of these variables only have 2 levels. A couple have a more than a 100.


Feature Engineering

We have a little bit of feature engineering to do before we can run our selected models. We started with doing a log transform on the training set's loss feature. Now we have 116 categorical variables that need to be dummified. Combining the test and training sets for completeness, I tested 2 methods for transforming the categorical features.

The first was to cycle through the columns and to explicitly turn the categorical variables into integer vectors that represented the number of levels of that variable. This only took about 3.5 seconds on my laptop.

[code language="r"]
for (f in features) {
 if (class(train_and_test[[f]]) == "character") {
 levels = unique(train_and_test[[f]])
 train_and_test[[f]] = as.integer(factor(train_and_test[[f]], levels = levels))

The second method was to utilize the caret library's dummyVars method. I actually terminated this procedure after 3 hours do to time constraints.

The Models

The really nice thing about using the Caret library is the uniform structure for handling nearly all phases of machine learning (data preprocessing, training, and scoring). Here is a sample of training/execution set up for my eXtreme Gradient Boosting model.

[code language="r"]
train.control <- trainControl(method = "repeatedcv",
 number = 10,
 repeats = 3,
 summaryFunction = maeOnLogMetric,
 search = "grid")

xgbTree.tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
 nrounds = c(50, 75, 100),
 max_depth = 6:8,
 min_child_weight = c(2.0, 2.25, 2.5),
 colsample_bytree = c(0.3, 0.4, 0.5),
 gamma = 0,
 subsample = 1)

xgbTree_model <- train(loss ~ .,
 data = sample.train,
 method = "xgbTree", 
 metric = "MAE",
 trControl = train.control,
 tuneGrid = xgbTree.tune.grid )

xgbTree_predictions <- predict(xgbTree_model, sample.test)

All my models used the same training control: Monte Carlo 10 fold cross-validation, run 3 times with a grid search to examine a range of parameter values. Leaving off the self-defined tune.grid parameter lets Caret choose a range of values for a specific model. This is a great option to have when you are not entirely sure what values are most appropriate for the model and data. And at the very least, it provides a starting place to further tune a specific model to optimal values since that appears to be a significant component for predictive power.

I ultimately trained and tested four regression models: eXtreme Gradient Boosting, Glmnet, Linear Regression, and a Neural Network. The results:

eXtreme Gradient Boosting (xgbTree)  1306  NA
Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet)  1467  2536
Linear Regression (lm)  1329  2346
Neural Network (nnet)  1300  NA

Although the testing results are not terrible with just a single model, it clearly shows they are insufficient in comparison to the results many others have achieved stacking models (less than 1100).

To Be Continued...

So clearly processing time of these models on large data sets can be very significant depending on your computational resources. Working on a mid-range laptop computer was excruciating, to say the least. So I upgraded to a 36 CPU Amazon EC2 instance. However, the real fun begins when you have the processing power to really experiment using different ensembles of models in various cloud processing environments. I also started testing on Microsoft's Azure ML platform using the straight R Script Execution method. It has potential, but its inputs can be fairly limited without some creative bundling. The same goes for supported libraries. Although the base supported set of libraries are fairly large, invariably it will be missing some very important ones you need (i.e. xgboost and data.table), and again you have to be very creative to bundle them in (basically build and zip up the libraries yourself).

With this in mind, my next steps are to use this dataset as a baseline to explore caret's modeling capacity and the equivalent in python's scikit-learn. Creating an efficient pipeline for the various models is a must. I intend to leverage several cloud platforms for testing and comparison.

Besides investigating the prediction accuracy various pipelines offer, as a software engineer, I'm especially curious how these ever-growing data sets can be efficiently modeled using parallel programming (a nice feature of XGBoost, and caret/doSNOW in general), algorithmic enhancements (Big O complexity), and other intrinsic/environmental features.

Let the fun begin.




About Author

Related Articles

Leave a Comment

Your email address will not be published. Required fields are marked *

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags