Insurance Claim: Data Analysis
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Predicting Claim or No-claim, A classification problem
The objective of this project is to use data to predict whether a customer made a claim upon an insurance policy. This is a classification problem, and the predicted number could be anywhere from 0.0 to 1.0, representing the probability of a claim. The dataset used for this is synthetic and the features are anonymized for security purposes.
A train dataset is provided with 118 features and one target “claim” with a value of 1 or 0. We are required to implement a binary-classification algorithm that predicts for each example of the test dataset, whether a customer made a claim upon an insurance policy. A '1' value means a claim was made, and '0' means a claim was not made.
Exploratory Data Analysis:
There is a total of 957919 training examples, having 118 features ranging from 'f1' to 'f118', and 1 target column, i.e., claim which corresponds to - whether the claim was made (1) or not (0).
All the features in the dataset are of type float64, and the ground truth column, i.e., claim is of type int64. Target Column: Now, let's look at how the target column claim is distributed throughout the dataset.
Feature Engineering:
Generally, when we talk about feature engineering, we mean combining the existing features (or engineering new features from them for that matter). However, for this dataset, we have no knowledge about what the features are, neither their impact on the target feature. So, new features won't help us much for this dataset.
Furthermore, there is no need to classify features into different datatypes (this helps later while processing the dataset) as all features are of type float only.
Distribution Data Analysis:
Let's see how the features are distributed with respect to the target variable. NOTE: Since we have a very large dataset, we will plot these distributions taking a small sample from the dataset. For better estimations, we will take a random sample, preferably of fraction 1/100 of the original dataset. This will help in faster generation of plots.
Correlation Data Analysis:
We noticed earlier that the relation between features and the target variable is most likely weak. To check that further, we'll make use of a correlation plot. Also, this will help us to check which features are strongly related to one another.
There are a few relatively strong correlations have very small correlation coefficient values from a general P.O.V. To elaborate, the slider on the right depicts that the upper bound on positive correlations is approx. 0.04 and the lower bound on negative correlations is approx. -0.06. These two bounds are too small to declare a strong correlation between the features.
Ps: Here, I define a strong correlation as one having correlation coefficient value greater than 0.6 (meaning strong positive correlation) or less than -0.6 (meaning strong negative correlation). Of course, these thresholds are subject to the author.
We can now safely say that none of the features have a strong correlation among one another, or with the target variable. This marks the end of a fruitless correlation analysis.
Data Cleaning:
Before proceeding any further, it is recommended to split the dataset into a training set and a hold-out cross-validation set. This is to ensure that the model we build won't be adversely affected by data leakage.
Any feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction, is a feature that can introduce leakage to your model
Now, when we talk about splitting the data into train and test sets, we normally have two options:
- Splitting the dataset into a training set and a test set using train_test_split().
- Using k-fold cross-validation sets.
The performance of the two techniques typically depends on the size of the dataset.
- When we have a small or limited training dataset, then using K fold CV is recommended. This is because we are always looking for a method that could maximize the data that we train our model on. Also, for a small training set, train_test_split could lead to inconsistent predictions for the test set.
- On the other hand, for a large dataset like the one we're given here, using K fold would greatly compromise on the computation speed of our model. Also, since we have many training samples, using only train_test_split should be enough information for our model to properly learn the parameters.
So, as a conclusion to the above two points, we will use train_test_split for this dataset.
Missing Values:
As we saw earlier, most of the features have missing values. We will take care of that now.
Luckily, for the given dataset, we have only numerical features and hence, imputation will be lot simpler. For numerical data, two most suitable imputation techniques that could be used here are mean imputation and median imputation. I will try both these techniques and compare their performance on the validation set.
Model Selection and Fitting:
Logistic Regression:
When we talk about a binary class classification problem, the simplest model that comes to mind is Logistic Regression. So, first off, we will use logistic regression as our base model for comparison. Naturally, it won't do very well on such a complex dataset, but it's always better to start from the bottom and build to the top.
Seems that the score for simple Logistic Regression fit on a sample of the training set is close to the score for the model fit on the entire training set. But we did cut off on the training time (approx. 100x faster). This might be the cause of having a well-balanced dataset for the target variable, and thus having taken only a small sample of the dataset we were able to feed much of the useful information about the dataset to our model.
Of course, this may not be entirely true for more complex models, and I don't recommend cutting down on the training examples in such fashion. But, for this dataset only, I believe that if we take a sample of the training set to train our models, then we can train different models for comparison and frequently tweak the hyperparameters without having to wait for long minutes every time. However, this is just a gamble, and I don't recommend adopting this technique for better results.
Naive-Bayes classifier:
Another famous classifier is the Naive-Bayes classifier. It has the additional advantage that it is very fast for large datasets such as the one we're working on.
We got a tiny 1% improvement in our score. However, this is far from the performance we would expect from our final model.
XGB CLASSIFIER:
Now let's get serious and train more reasonable models. One good practice is to always train a baseline model first on the training set and observe its performance. It gives us a starting step to compare our later models.
Conclusion:
None of the models show signs of overfitting. Out of the three, XGBClassifier gave the best results on the Validation Set. All things considered, if computation speedup is an important priority, XGB can be picked for the cost of a little less roc-auc score.
Future Scope:
Due to GPU incompatibility issues, couldn’t train the data with more advanced models such as Light GBM and Cat Boost. Hoping to extend this research to gain better performance.