Property Casualty Loss Cost Data Type Case Study
Contributed by Ruonan Ding, Frank Wang. They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their final class project - Capstone(due on the 12th week of the program).
Introduction:
Liberty Mutual Insurance company provided on Kaggle competition 2 years ago a fire insurance dataset to predict the loss ratio. Within the insurance industry, fire losses account for a significant portion of total property losses. The dataset is typical property and insurance data: high severity, low frequency and inherently volatile, which makes modeling difficult. If we could model the expected loss correctly, we will be able to accurately identify each policyholderโs risk exposure and to tailor the insurance coverage and premium level to each individual.
Challenge:
The goal for this competition is to predict the loss cost of total insured value of insurance policies, which is a ratio of total claims over the amount of value the owner insured over the property. There are two major challenges over this dataset: The event of filing a fire claim is extremely low. It is close to 0.2% in the overall training dataset. Secondly, there are many features provided in the dataset. There are 4 categories of features that totals about 300+ features.
Feature Engineering:
Feature engineering is crucial in this case. After many iteration of different feature selections and reductions, we were actually to boost up our ranking and overall scores.
The raw features include four blocks of features: policy characteristic (17 variables), weather (236 variables), crime rate (9 variables), and geodemographics (37 variables).
After all the trimming and reduction, we carefully selected the following strategy on different blocks of variables.
- CrimeRate and Geodemographic Variables:
There variables were reduced using principle component dimension reduction to two dimensions each. The variance in target was explained about 80% by those two dimensions. It also eliminates the correlations between different variables.
- Weather Variables:
The weather has many variables. We began with PCA as well. However, we need to include many dimensions to include all the variance in data. Therefore, L1 penalty Lasso was used to reduce weather variables.
- Policy Characteristics:
The policies have some variables that are essential in pricing. We first dropped the variables that have more than 50% of missing values, and then converted the categorical variables to dummy variables.
So here are the final variables after feature engineering:
Modeling
We tried both regression and classification model. In the linear regression methods, Elastic Net regression is helpful in feature selection while Gradient Boost regression is more robust and can provide better predictions.
In Elastic Net regression, the optimized tuning parameters (a, r) is small. CV is used to find the best parameters. The GINI score has a peak at a=1e-7, which indicates a small penalty on the coefficient. The GINI score of the CV test is high at 0.35. There is a similar public score when submitted on Kaggle. However, the private score drops to 0.25, indicates sort of overfitting or sensitivity to the dataset.
The optimized Gradient Boost regression parameters have slow learning rate and small number of iteration: learning_rate=0.05 and n_estimators=100. This model performs better than Elastic Net regression and more robust with public/private score of 0.359/0.286.
For classification, all non-zero responses are converted to 1. This is a valid model given the fact that this is rare even: only 0.26% positive response. Logistic regression and Gradient Boost Classifier give similar performance in term of private score.
XGBoost outperforms other models in two aspects: the best private score and the robust model. The public and private score are 0.35/0.30, respectively. โcount:poissonโ is choose as the objective. For such rare event, poisson distribution is more appropriate to represent the distribution.
Conclusion and Takeaways:
- Feature Engineering is KEY:
- Extracting value from blocks of features;
- Reduce correlation between variables - PCA
- Reduce noise by significance - L1 penalty
- Our scored improved on average 15% just by feature selections.
- Sampling technique is important with very rare event:
- 100K including all zero losses and the 1188 response
- cross validation
- Poisson distribution as the link function is suitable for rare count event.