Kaggle competition: Liberty Mutual Group Property Inspection Prediction

Avatar
Posted on Aug 19, 2015

Machine Learning with Liberty Mutual Group Property Inspection Prediction Kaggle Data

– Authors: Claire Tu, Sumanth Reddy, Teresa Venezia, Xavier Capdepon, Zeyu Zhang.

– Claire, Sumonth, Teresa, Xavier and Zeyu were a student of the Data Science Bootcamp#2 (B002) – Data Science, Data Mining and Machine Learning – from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Sam and Vivian.

– The post is based on their Kaggle Competition project final submission.

houses

Project:

Liberty Mutual Group Kaggle was a team project embarked on by Claire Tu, Sumanth Reddy, Teresa Venezia, Xavier Capdepon, and Zeyu Zhang.  The challenge was to build a model to predict the “hazard score” using a dataset of property information to “enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability.”  Here is the link to the Kaggle competition website: https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction

Exploring the Data

We first performed exploratory data analysis (EDA) to gain an understanding of the data, visualize any pattern or trends, and make decisions about features to use as inputs in the model.  The training data contained the “Hazard” response variable along with 32 predictor variables. The Hazard variable appeared as a numerical, discrete data type, although Liberty Mutual indicated that you “can think of the hazard score as a continuous number that represents the condition of the property as determined by the inspection.”  The predictor variables appeared to be either numerical, discrete data types or categorical data types.  This was difficult to discern because Liberty Mutual anonymized these predictor variables, preventing us from intuitively grouping these variables into categories (such as property or safety conditions, geography, and weather).  Not knowing the nature of these variable presented challenges that became readily apparent at this early stage.  For example, it was not clear if the 15 predictor variables that appeared as numeric might instead represent characteristics of the property.  Additionally, the 17 categorical data types were broken down into two categories: 12 categorical data types using single letters with no clear indication of ranking or importance, and 5 categorical Yes/No data types.  Furthermore, none of the predictor variables displayed strong correlations with the target Hazard, with only one showing a correlation greater than 0.1.

corr matrix

figure 1: Correlation Matrix Visualization of numerical variables

Closer Look at the Hazard Scores

The histogram of Hazard shows the values ranging from 1 through 70, with more than a third of the training data having a value of 1.histo of hazard rate

figure 2: Histogram of the Hazard score

With this in mind, we considered using a one-vs-all approach (i.e., Hazard scores of 1 vs. all other scores).  But our early exploration of this approach resulted in various challenges.  For example, when using a K-Nearest Neighbors method in Python and R our computers crashed, as the high-dimensionality of the data demanded an extremely expensive computation.  Other methods, such as logistic regression were computed but presented different challenges with the multi-level categorical variables. We considered transforming these categorical variables, however, without knowing the nature of these variables a transformation could fabricate a structure in the variance that did not initially exist.  Finally, we tried logistic regression using only the numerical variables, but the model was not a good fit and we decided not to pursue the one vs. all idea.

Feature Selection

We also focused on reducing the dimensionality of the data through Feature Selection.  First we examined the variance of each predictor variable, hoping to remove any variables with distinguishably low variance. However, most of the variables displayed similar variances.

features var for num columns

figure 3:  Low variance features selection graph for the numerical variables

We enjoyed some success when implementing the Random Forest method, which attempts to weight the importance of each predictor.  With this method, 4 predictor variables displayed limited or negative importance relative to the rest, and we considered removing those features.

rplot

figure 4:  Random Forest features selection graph far all variables

Our findings were similar when running univariate feature selection with the X2 test and F test for the numerical variables.

univariate features selection

figure 5:  Univariate features selection graph with X2 test for the numerical variables

univariate features selection F

figure 6:  Univariate features selection graph with F test for the numerical variables

Exploring Certain Models

Support Vector Machine (SVM):

We considered building a classifier using a Support Vector Machine method.  A SVM model represents data points from a training data set in a multi-dimensional space, mapped so that the data points of different categories are divided by a clear gap that is as wide as possible. The SVM builds a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification.  When used on the test data set, the test data points are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.  Given the numerous categorical data, the model was relatively slow to process.  The SVM method provided categorical outputs which were not the most appropriate model for this exercise as a lot of information is lost by using the features in this categorical model.  In fact, the description of the competition mentioned that the hazard score shall be seen as a continuous variable and given the first few results in the 0.3188, we decided not to pursue this approach.

Gradient boosting machine (GBM):

In gradient boosting machines, the learning procedure fits new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. It is particularly interesting in this exercise as the nature of the features is unknown.  We achieved a score of 0.368800 after running a grid of parameters for several hours.  Using this method, we spent considerable time “tuning” or tweaking the parameter values trying to generate a better fit, which took up to 3 hours to run depending on the parameter combinations.  We also observed that the running time for GBM in R was considerably reduced by parallelizing the “caret” package using the “doParallel” package on 7 CPUs where the “caret” package is launching the GBM R package.

Model Selection: xgboost and Random Forest Ensemble

Even after removing features that we found to be “less important”, we were still left with up to 30 predictor variables that showed no clear linear relationship with the Hazard target variable.

stripplot 5 top features

figure 7: “stripplot” graphs of the 5 top features selected by random forest

Random Forest:

Due to this fact as well as our results with other models, we decided to use the Random Forest Regressor model in scikit-learn.  We ran two for loops to tune the two most important parameters in the random forest model: the n_estimators, which determines the number of trees in the forest, and also the max_depth, which determines the complexity or maximum depth of each tree.  Our Random Forest Regressor model gave us a normalized gini score of 0.36 on the public leader board, which ranked in the top 40 percent.  It was a good result, however, we were not satisfied.

Xgboost:

When searching the Kaggle forum, we found that most people on the leaderboard were using xgboost (extreme gradient boosting), which is a newly developed machine learning model.  When using random forest, we are trying to optimize the predictors; however, xgboost optimizes the whole tree instead of predictors.  We trained two xgboost models, one with all 32 features, and one excluding the least important four features.  By taking the sum of the two different weighted models, we reduced the probability of overfitting using the ensemble method.  The final xgboost model led us to the top 10 percent of the leaderboard.

Ensembling RandomForest and xgboost:

The idea of the ensemble method used in xgboost inspired us to ensemble the xgboost model and the Random Forest model.  We were encouraged to consider combining a model with a lower accuracy and higher accuracy by the MLWave blog post on ensembling approaches for Kaggle Competitions ( http://mlwave.com/kaggle-ensembling-guide/ ). In that post, the author provides a detailed discussion about why ensembling a lower accuracy model increases the final outcome, pointing to several Kaggle competitions as examples.  Following the guidance of the blog post, we are ranking 139/2109 right now (top 7%).

Comparison of Model Scores

Below is a table that illustrates the results and ranks of our models and the current Kaggle leaderboard, and we are continuing to improve our model!

 

Model Score Rank
Best Competitor 0.395812 1
Xgboost (ensemble) 0.391169 139
Random Forest 0.373147 1086
GBM 0.3688 1160
SVM 0.3188 1740

 

About Author

Related Articles

Leave a Comment

Avatar
el blog del narco 2016 October 15, 2017
So now it appears full circle back to the individual who may have taken on the role in the blogging site administrator. However, whenever a blogger does enjoy what he writes about, he'll probably adhere to it for a time. A blog writing service knows how to spark the eye to your products and services, that may drive more visitors and bring about higher revenues for your business.
Avatar
xavier August 22, 2015
we used python: import seaborn as sns sns.corrplot(libMut_train)
Avatar
Harvey August 21, 2015
What did you use to create the correlation matrix?

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp