Property Casualty Loss Cost Data Type Case Study

Frank Wang
Posted on Jul 5, 2016

Contributed by , . They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their final class project - Capstone(due on the 12th week of the program).


Liberty Mutual Insurance company provided on Kaggle competition 2 years ago a fire insurance dataset to predict the loss ratio.  Within the insurance industry,  fire losses account for a significant portion of total property losses.   The dataset is typical property and insurance data: high severity, low frequency and inherently volatile, which makes modeling difficult.  If we could model the expected loss correctly, we will be able to accurately identify each policyholder’s risk exposure and to tailor the insurance coverage and premium level to each individual.


The goal for this competition is to predict the loss cost of total insured value of insurance policies, which is a ratio of total claims over the amount of value the owner insured over the property.   There are two major challenges over this dataset: The event of filing a fire claim is extremely low.  It is close to 0.2% in the overall training dataset.  Secondly, there are many features provided in the dataset.  There are 4 categories of features that totals about 300+ features.

Feature Engineering:

Feature engineering is crucial in this case.  After many iteration of different feature selections and reductions, we were actually to boost up our ranking and overall scores.

The raw features include four blocks of features: policy characteristic (17 variables), weather (236 variables), crime rate (9 variables), and geodemographics (37 variables).

Screen Shot 2016-06-30 at 12.48.58 AM






After all the trimming and reduction, we carefully selected the following strategy on different blocks of variables.

  • CrimeRate and Geodemographic Variables:

There variables were reduced using principle component dimension reduction to two dimensions each.  The variance in target was explained about 80% by those two dimensions.  It also eliminates the correlations between different variables.

  • Weather Variables:

The weather has many variables. We began with PCA as well.  However, we need to include many dimensions to include all the variance in data.  Therefore, L1 penalty Lasso was used to reduce weather variables.

  • Policy Characteristics:

The policies have some variables that are essential in pricing.  We first dropped the variables that have more than 50% of missing values, and then converted the categorical variables to dummy variables.

So here are the final variables after feature engineering:

Screen Shot 2016-07-05 at 12.20.03 PM


We tried both regression and classification model. In the linear regression methods, Elastic Net regression is helpful in feature selection while Gradient Boost regression is more robust and can provide better predictions.

In Elastic Net regression, the optimized tuning parameters (a, r)  is small. CV is used to find the best parameters. The GINI score has a peak at a=1e-7, which indicates a small penalty on the coefficient. The GINI score of the CV test is high at 0.35. There is a similar public score when submitted on Kaggle. However, the private score drops to 0.25, indicates sort of overfitting or sensitivity to the dataset.

The optimized Gradient Boost regression parameters have slow learning rate and small number of iteration: learning_rate=0.05 and n_estimators=100. This model performs better than Elastic Net regression and more robust with public/private score of 0.359/0.286.

For classification, all non-zero responses are converted to 1. This is a valid model given the fact that this is rare even: only 0.26% positive response.  Logistic regression and Gradient Boost Classifier give similar performance in term of private score.

XGBoost outperforms other models in two aspects: the best private score and the robust model. The public and private score are 0.35/0.30, respectively.  “count:poisson” is choose as the objective. For such rare event, poisson distribution is more appropriate to represent the distribution.

Screen Shot 2016-07-05 at 12.29.21 PM

Conclusion and Takeaways:

  • Feature Engineering is KEY:
    • Extracting value from blocks of features;
    • Reduce correlation between variables - PCA
    • Reduce noise by significance - L1 penalty
    • Our scored improved on average 15% just by feature selections.
  • Sampling technique is important with very rare event:
    • 100K including all zero losses and the 1188 response
    • cross validation
  • Poisson distribution as the link function is suitable for rare count event.

About Authors


Ruonan Ding

Ruonan Ding has more than five years of experience in the actuarial science and financial field across asset management and insurance sectors. She was a pricing actuary for a property and casualty company, a lead analyst in capital...
View all posts by Ruonan Ding >
Frank Wang

Frank Wang

Frank (Lanfa) Wang have worked in several research laboratories as a physicist. He has over a decade of experience in modeling and scientific computing and had access to the large supercomputer NERSC. He participated several national/international projects: Japanese...
View all posts by Frank Wang >

Related Articles

Leave a Comment

Avatar January 31, 2017
I was willing to devote a lot more to retake my pictures due to the fact this place is that excellent.
Avatar January 29, 2017

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp