Property Casualty Loss Cost Data Type Case Study

Posted on Jul 5, 2016

Contributed by , . They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on their final class project - Capstone(due on the 12th week of the program).


Liberty Mutual Insurance company provided on Kaggle competition 2 years ago a fire insurance dataset to predict the loss ratio.  Within the insurance industry,  fire losses account for a significant portion of total property losses.   The dataset is typical property and insurance data: high severity, low frequency and inherently volatile, which makes modeling difficult.  If we could model the expected loss correctly, we will be able to accurately identify each policyholder’s risk exposure and to tailor the insurance coverage and premium level to each individual.


The goal for this competition is to predict the loss cost of total insured value of insurance policies, which is a ratio of total claims over the amount of value the owner insured over the property.   There are two major challenges over this dataset: The event of filing a fire claim is extremely low.  It is close to 0.2% in the overall training dataset.  Secondly, there are many features provided in the dataset.  There are 4 categories of features that totals about 300+ features.

Feature Engineering:

Feature engineering is crucial in this case.  After many iteration of different feature selections and reductions, we were actually to boost up our ranking and overall scores.

The raw features include four blocks of features: policy characteristic (17 variables), weather (236 variables), crime rate (9 variables), and geodemographics (37 variables).

Screen Shot 2016-06-30 at 12.48.58 AM

After all the trimming and reduction, we carefully selected the following strategy on different blocks of variables.

  • CrimeRate and Geodemographic Variables:

There variables were reduced using principle component dimension reduction to two dimensions each.  The variance in target was explained about 80% by those two dimensions.  It also eliminates the correlations between different variables.

  • Weather Variables:

The weather has many variables. We began with PCA as well.  However, we need to include many dimensions to include all the variance in data.  Therefore, L1 penalty Lasso was used to reduce weather variables.

  • Policy Characteristics:

The policies have some variables that are essential in pricing.  We first dropped the variables that have more than 50% of missing values, and then converted the categorical variables to dummy variables.

So here are the final variables after feature engineering:

Screen Shot 2016-07-05 at 12.20.03 PM


We tried both regression and classification model. In the linear regression methods, Elastic Net regression is helpful in feature selection while Gradient Boost regression is more robust and can provide better predictions.

In Elastic Net regression, the optimized tuning parameters (a, r)  is small. CV is used to find the best parameters. The GINI score has a peak at a=1e-7, which indicates a small penalty on the coefficient. The GINI score of the CV test is high at 0.35. There is a similar public score when submitted on Kaggle. However, the private score drops to 0.25, indicates sort of overfitting or sensitivity to the dataset.

The optimized Gradient Boost regression parameters have slow learning rate and small number of iteration: learning_rate=0.05 and n_estimators=100. This model performs better than Elastic Net regression and more robust with public/private score of 0.359/0.286.

For classification, all non-zero responses are converted to 1. This is a valid model given the fact that this is rare even: only 0.26% positive response.  Logistic regression and Gradient Boost Classifier give similar performance in term of private score.

XGBoost outperforms other models in two aspects: the best private score and the robust model. The public and private score are 0.35/0.30, respectively.  “count:poisson” is choose as the objective. For such rare event, poisson distribution is more appropriate to represent the distribution.

Screen Shot 2016-07-05 at 12.29.21 PM

Conclusion and Takeaways:

  • Feature Engineering is KEY:
    • Extracting value from blocks of features;
    • Reduce correlation between variables - PCA
    • Reduce noise by significance - L1 penalty
    • Our scored improved on average 15% just by feature selections.
  • Sampling technique is important with very rare event:
    • 100K including all zero losses and the 1188 response
    • cross validation
  • Poisson distribution as the link function is suitable for rare count event.

About Authors

Ruonan Ding

Ruonan Ding has more than five years of experience in the actuarial science and financial field across asset management and insurance sectors. She was a pricing actuary for a property and casualty company, a lead analyst in capital...
View all posts by Ruonan Ding >

Frank Wang

Frank (Lanfa) Wang have worked in several research laboratories as a physicist. He has over a decade of experience in modeling and scientific computing and had access to the large supercomputer NERSC. He participated several national/international projects: Japanese...
View all posts by Frank Wang >

Related Articles

Leave a Comment January 31, 2017
I was willing to devote a lot more to retake my pictures due to the fact this place is that excellent. January 29, 2017

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI