Kaggle competition: Liberty Mutual Group Property Inspection
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Machine Learning with Liberty Mutual Group Property Inspection Prediction Kaggle Data
– Authors: Claire Tu, Sumanth Reddy, Teresa Venezia, Xavier Capdepon, Zeyu Zhang.
– Claire, Sumonth, Teresa, Xavier and Zeyu were a student of the Data Science Bootcamp#2 (B002) – Data Science, Data Mining and Machine Learning – from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Sam and Vivian.
– The post is based on their Kaggle Competition project final submission.
Liberty Mutual Group Kaggle was a team project embarked on by Claire Tu, Sumanth Reddy, Teresa Venezia, Xavier Capdepon, and Zeyu Zhang. The challenge was to build a model to predict the “hazard score” using a dataset of property information to “enable Liberty Mutual to more accurately identify high risk homes that require additional examination to confirm their insurability.” Here is the link to the Kaggle competition website: https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
Exploring the Data
We first performed exploratory data analysis (EDA) to gain an understanding of the data, visualize any pattern or trends, and make decisions about features to use as inputs in the model. The training data contained the “Hazard” response variable along with 32 predictor variables. The Hazard variable appeared as a numerical, discrete data type, although Liberty Mutual indicated that you “can think of the hazard score as a continuous number that represents the condition of the property as determined by the inspection.”
The predictor variables appeared to be either numerical, discrete data types or categorical data types. This was difficult to discern because Liberty Mutual anonymized these predictor variables, preventing us from intuitively grouping these variables into categories (such as property or safety conditions, geography, and weather). Not knowing the nature of these variable presented challenges that became readily apparent at this early stage.
For example, it was not clear if the 15 predictor variables that appeared as numeric might instead represent characteristics of the property. Additionally, the 17 categorical data types were broken down into two categories: 12 categorical data types using single letters with no clear indication of ranking or importance, and 5 categorical Yes/No data types. Furthermore, none of the predictor variables displayed strong correlations with the target Hazard, with only one showing a correlation greater than 0.1.
figure 1: Correlation Matrix Visualization of numerical variables
Closer Look at the Hazard Scores
The histogram of Hazard shows the values ranging from 1 through 70, with more than a third of the training data having a value of 1.
figure 2: Histogram of the Hazard score
With this in mind, we considered using a one-vs-all approach (i.e., Hazard scores of 1 vs. all other scores). But our early exploration of this approach resulted in various challenges. For example, when using a K-Nearest Neighbors method in Python and R our computers crashed, as the high-dimensionality of the data demanded an extremely expensive computation.
Other methods, such as logistic regression were computed but presented different challenges with the multi-level categorical variables. We considered transforming these categorical variables, however, without knowing the nature of these variables a transformation could fabricate a structure in the variance that did not initially exist. Finally, we tried logistic regression using only the numerical variables, but the model was not a good fit and we decided not to pursue the one vs. all idea.
We also focused on reducing the dimensionality of the data through Feature Selection. First we examined the variance of each predictor variable, hoping to remove any variables with distinguishably low variance. However, most of the variables displayed similar variances.
figure 3: Low variance features selection graph for the numerical variables
We enjoyed some success when implementing the Random Forest method, which attempts to weight the importance of each predictor. With this method, 4 predictor variables displayed limited or negative importance relative to the rest, and we considered removing those features.
figure 4: Random Forest features selection graph far all variables
Our findings were similar when running univariate feature selection with the X2 test and F test for the numerical variables.
figure 5: Univariate features selection graph with X2 test for the numerical variables
figure 6: Univariate features selection graph with F test for the numerical variables
Exploring Certain Models
Support Vector Machine (SVM):
We considered building a classifier using a Support Vector Machine method. A SVM model represents data points from a training data set in a multi-dimensional space, mapped so that the data points of different categories are divided by a clear gap that is as wide as possible. The SVM builds a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification.
When used on the test data set, the test data points are mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. Given the numerous categorical data, the model was relatively slow to process. The SVM method provided categorical outputs which were not the most appropriate model for this exercise as a lot of information is lost by using the features in this categorical model.
In fact, the description of the competition mentioned that the hazard score shall be seen as a continuous variable and given the first few results in the 0.3188, we decided not to pursue this approach.
Gradient boosting machine (GBM):
In gradient boosting machines, the learning procedure fits new models to provide a more accurate estimate of the response variable. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. It is particularly interesting in this exercise as the nature of the features is unknown.
We achieved a score of 0.368800 after running a grid of parameters for several hours. Using this method, we spent considerable time “tuning” or tweaking the parameter values trying to generate a better fit, which took up to 3 hours to run depending on the parameter combinations. We also observed that the running time for GBM in R was considerably reduced by parallelizing the “caret” package using the “doParallel” package on 7 CPUs where the “caret” package is launching the GBM R package.
Model Selection: xgboost and Random Forest Ensemble
Even after removing features that we found to be “less important”, we were still left with up to 30 predictor variables that showed no clear linear relationship with the Hazard target variable.
figure 7: “stripplot” graphs of the 5 top features selected by random forest
Due to this fact as well as our results with other models, we decided to use the Random Forest Regressor model in scikit-learn. We ran two for loops to tune the two most important parameters in the random forest model: the n_estimators, which determines the number of trees in the forest, and also the max_depth, which determines the complexity or maximum depth of each tree. Our Random Forest Regressor model gave us a normalized gini score of 0.36 on the public leader board, which ranked in the top 40 percent. It was a good result, however, we were not satisfied.
When searching the Kaggle forum, we found that most people on the leaderboard were using xgboost (extreme gradient boosting), which is a newly developed machine learning model. When using random forest, we are trying to optimize the predictors; however, xgboost optimizes the whole tree instead of predictors.
We trained two xgboost models, one with all 32 features, and one excluding the least important four features. By taking the sum of the two different weighted models, we reduced the probability of overfitting using the ensemble method. The final xgboost model led us to the top 10 percent of the leaderboard.
Ensembling RandomForest and xgboost:
The idea of the ensemble method used in xgboost inspired us to ensemble the xgboost model and the Random Forest model. We were encouraged to consider combining a model with a lower accuracy and higher accuracy by the MLWave blog post on ensembling approaches for Kaggle Competitions ( http://mlwave.com/kaggle-ensembling-guide/ ).
In that post, the author provides a detailed discussion about why ensembling a lower accuracy model increases the final outcome, pointing to several Kaggle competitions as examples. Following the guidance of the blog post, we are ranking 139/2109 right now (top 7%).
Comparison of Model Scores
Below is a table that illustrates the results and ranks of our models and the current Kaggle leaderboard, and we are continuing to improve our model!