A Walk-Through The Kaggle Allstate Insurance Claim Challenge

Posted on Jun 15, 2017

Can you predict my financial pain? In a way, Allstate was asking this question via a Kaggle Challenge they sponsered at the end of 2016. Specifically, the challenge was to predict the cost of claims. It is a quest for a model of ever increasing accuracy.

So let's dive in and begin a walk-through of the process in seeking an answer to this predictive mission.

The Data

Allstate provided a heavily anonymized training and test set. Let's take a look.

[code language="r"]
dim(train)

[1] 188318    132

sum(is.na(train))

[1] 0
as.data.frame(table(sapply(train, class)))

    Var1 Freq
1 factor  116
2 integer 1
3 numeric 15
[/code]

We have 188318 complete observations comprised of 116 categorical variables, an id integer variable, and 15 continuous variables ( 14 are features and the last is our ultimate response variable, loss).

Let's take a closer look at our 14 continuous features.

[code language="r"]
sapply(train.continuous, function(cl) list(means=mean(cl,na.rm=TRUE), sds=sd(cl,na.rm=TRUE)))

      id       cont1     cont2     cont3     cont4     cont5     cont6     cont7     cont8 
means 294136   0.4938614 0.5071884 0.4989185 0.4918123 0.4874277 0.4909445 0.4849702 0.4864373
sds   169336.1 0.1876402 0.2072017 0.2021046 0.2112922 0.2090268 0.2052726 0.1784502 0.1993705
      cont9     cont10    cont11    cont12    cont13    cont14    loss 
means 0.4855063 0.4980659 0.493511  0.4931504 0.4931376 0.495717  3037.338
sds   0.1816602 0.1858767 0.2097365 0.2094266 0.2127772 0.2224875 2904.086
[/code]

With all the primary features having an approximate .5 mean and .2 standard deviation, it appears the data set has already been pre-processed. The average loss is coming in at just a bit over $3000.00. How do the histograms of these features look?


And let's also check the skewness.

[code language="r"]
apply(train.continuous[,-c(1)], 2, skewness)

      cont1       cont2       cont3      cont4      cont5      cont6      cont7      cont8 
 0.51641579 -0.31093630 -0.01000212 0.41608940 0.68161158 0.46120692 0.82603967 0.67662329 
      cont9     cont10     cont11     cont12     cont13     cont14       loss 
 1.07241164 0.35499529 0.28081695 0.29198743 0.38073614 0.24867013 3.79489792 
[/code]

Loss' skewness value of ~3.79 looks out of range for the average -2 to +2 acceptable range (George & Mallery, 2010) . What does its density plot look like.?

Ok, so the data is skewed right. We'll do a log transform on it, and check it out again.

Looks good. How about correlations.

There are some strong correlations between features in the bottom right quadrant. Not great for trying any linear regression models without some feature work.

Now let's move on and examine our categorical variables.

 

Looks like the vast majority of these variables only have 2 levels. A couple have a more than a 100.

 

Feature Engineering

We have a little bit of feature engineering to do before we can run our selected models. We started with doing a log transform on the training set's loss feature. Now we have 116 categorical variables that need to be dummified. Combining the test and training sets for completeness, I tested 2 methods for transforming the categorical features.

The first was to cycle through the columns and to explicitly turn the categorical variables into integer vectors that represented the number of levels of that variable. This only took about 3.5 seconds on my laptop.

[code language="r"]
for (f in features) {
 if (class(train_and_test[[f]]) == "character") {
 levels = unique(train_and_test[[f]])
 train_and_test[[f]] = as.integer(factor(train_and_test[[f]], levels = levels))
 }
} 
[/code]

The second method was to utilize the caret library's dummyVars method. I actually terminated this procedure after 3 hours do to time constraints.

The Models

The really nice thing about using the Caret library is the uniform structure for handling nearly all phases of machine learning (data preprocessing, training, and scoring). Here is a sample of training/execution set up for my eXtreme Gradient Boosting model.

[code language="r"]
train.control <- trainControl(method = "repeatedcv",
 number = 10,
 repeats = 3,
 summaryFunction = maeOnLogMetric,
 search = "grid")

xgbTree.tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
 nrounds = c(50, 75, 100),
 max_depth = 6:8,
 min_child_weight = c(2.0, 2.25, 2.5),
 colsample_bytree = c(0.3, 0.4, 0.5),
 gamma = 0,
 subsample = 1)

xgbTree_model <- train(loss ~ .,
 data = sample.train,
 method = "xgbTree", 
 metric = "MAE",
 trControl = train.control,
 tuneGrid = xgbTree.tune.grid )

xgbTree_predictions <- predict(xgbTree_model, sample.test)
maeOnLog1pData(xgbTree_predictions,sample.test$loss)
[/code]

All my models used the same training control: Monte Carlo 10 fold cross-validation, run 3 times with a grid search to examine a range of parameter values. Leaving off the self-defined tune.grid parameter lets Caret choose a range of values for a specific model. This is a great option to have when you are not entirely sure what values are most appropriate for the model and data. And at the very least, it provides a starting place to further tune a specific model to optimal values since that appears to be a significant component for predictive power.

I ultimately trained and tested four regression models: eXtreme Gradient Boosting, Glmnet, Linear Regression, and a Neural Network. The results:

MODEL MAE RMSE
eXtreme Gradient Boosting (xgbTree)  1306  NA
Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet)  1467  2536
Linear Regression (lm)  1329  2346
Neural Network (nnet)  1300  NA

Although the testing results are not terrible with just a single model, it clearly shows they are insufficient in comparison to the results many others have achieved stacking models (less than 1100).

To Be Continued...

So clearly processing time of these models on large data sets can be very significant depending on your computational resources. Working on a mid-range laptop computer was excruciating, to say the least. So I upgraded to a 36 CPU Amazon EC2 instance. However, the real fun begins when you have the processing power to really experiment using different ensembles of models in various cloud processing environments. I also started testing on Microsoft's Azure ML platform using the straight R Script Execution method. It has potential, but its inputs can be fairly limited without some creative bundling. The same goes for supported libraries. Although the base supported set of libraries are fairly large, invariably it will be missing some very important ones you need (i.e. xgboost and data.table), and again you have to be very creative to bundle them in (basically build and zip up the libraries yourself).

With this in mind, my next steps are to use this dataset as a baseline to explore caret's modeling capacity and the equivalent in python's scikit-learn. Creating an efficient pipeline for the various models is a must. I intend to leverage several cloud platforms for testing and comparison.

Besides investigating the prediction accuracy various pipelines offer, as a software engineer, I'm especially curious how these ever-growing data sets can be efficiently modeled using parallel programming (a nice feature of XGBoost, and caret/doSNOW in general), algorithmic enhancements (Big O complexity), and other intrinsic/environmental features.

Let the fun begin.

 

 

 

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI