Kaggle Competition: BNP Paribas Cardif Claims Management
Contributed by Ablat, Bin Lin, Claudia Huang and Sung Pil Moon. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Jan. 11th to April 1st, 2016.This post is based on their fourth class project  Machine Learning (due on the 8th week of the program).
1. Overview
As part of a Kaggle competition, we are challenged to help BNP Paribas cardif to accelerate their claims management process in order to provide a better service to its customers.
In today's world, everything is going faster and faster. When facing unexpected events, customers expect their insurer to support them as soon as possible when facing unexpected events. However, claims management may require different levels of check before a claim can be approved and a payment can be made. In this challenge, we are going to classify the claims into the following two categories:
 claims for which approval could be accelerated leading to faster payments
 claims for which additional information is required before approval
In this project, we have done:
 Exploratory data analysis
 Feature engineering
 Missing data imputation
 Building prediction models with various machine learning algorithms, including Logistic Regression, Decision Trees, Random Forest, Xgboost, Neural Network.
2. The Data set
2.1. Data Summary
The data provided by BNP Paribas cardif are:
 TRAIN.CSV: traing dataset
 TEST.CSV: test dataset
 SAMPLE_SUBMISSION.CSV: example of submission format
These are the basic description of the training dataset:
 Total: 111,432 observations (rows) x 133 features (columns)
 Data Types: float, integer, string
 Column "ID": the ID of each row, not being used as predictor
 Column "target": the target of each row, not being used as predictor
 Numbers of Textbased Predictors: 112
 Numbers of Numeric Predictors: 19
 Numbers of Columns with missing value: 119
 Numbers of columns highly correlated: There are 123 pairs with absolute correlation > 0.8; there are 63 pairs with absolute correlation > 0.9.
2.2. Exploratory Data Analysis
As the first step, we performed exploratory data analysis to better understand the data. These are the initial findings throughout the EDA process.
 Anonymized Data: All data (both categorical and continuous) is anonymized without any description
ID  target  v1  v2  v3  ...  v127  v128  v129  v130  v131 

3  1  1.335739  8.727474  C  ...  3.113719  2.024285  0  0.636365  2.857144 
4  1  NaN  NaN  C  ...  NaN  1.957825  0  NaN  NaN 
5  1  0.943877  5.310079  C  ...  3.922193  1.120468  2  0.883118  1.176472 
6  1  0.797415  8.304757  C  ...  2.954381  1.990847  1  1.677108  1.034483 
8  1  NaN  NaN  C  ...  NaN  NaN  0  NaN  NaN 
Table 1: Data First Peek
 Too Many Missing Value: Approximately 40% of data is missing as you can see in the Figure 1.
 High correlations: As shown in the correlation matrix plot (Figure 2), some variables are highly correlated (not just 1tomany but also manytomany). Correlation values are between 1 and 1. Red color means positive correlation while blue color means negative correlation.
3. Data PreProcessing
Data preprocessing is an important step in our before we can even start building our models. Our data preprocessing includes, data cleaning, transformation, feature selection.
3.1. Data Cleaning
3.1.1. Treatment of Integer Variables with Low Numbers of Unique Values
There are four of the integer variables with low numbers of unique values: v38, v62, v72, v129. These will be treated as categorical variables. These variables would be factorized using binary dummies.
index  _a_variable  _b_data_type  _c_cardinality  _d_missings  _e_sample_values 

39  v38  int64  12  0  [0, 4] 
63  v62  int64  8  0  [1, 2] 
73  v72  int64  13  0  [1, 2] 
130  v129  int64  10  0  [0, 2] 
Table 2: Integer Variables with Low Cardinality
3.1.2. Imputation
Due to the high missingess in the dataset, we have sent a lot of time on data exploratory and figuring out the best imputation methods. The following are the methods we have tried.
 Numeric Imputation with Mean: We chose to imputed the numeric variables with mean to start with our data training.
 Numeric Imputation with Interpolate Linear
 Numeric Imputation with 999: impute with extreme numbers
 Categorical Imputation with "NA": Dropping all missing value in the categorical values does not help increasing the higher predictability. Some missingness itself can provide information, such as for the question like "How long have been in a marriage?", NA means 0 years of marriage.
 Prediction Model: which uses supervised algorithms (Linear regression, KNN, Treebased, etc) to predict the variables with missingness based on other variables. When we tried this method with KNN and Decision Tree, it took forever to run and never been able to finish. Due to tight timeline, we ended up giving up on this method.
As some experiment, we go with imputing numeric variables with 999, and categorical variables with text "NA" as these two options give us a better results.
3.2. Data Transformation
Many machine learning tool, including Python, will only accept numbers as input. This was a problem as Python was being used in our project. Fortunately, the Pandas package has get_dummies() function, which converts categorical variable into dummy/indicator variables.
Sample code: convert categorical data into binary dummy variables
# convert textbased columns to dummies for var_name in cate_variables: dummies = pd.get_dummies(df[var_name], prefix=var_name) # Drop the current variable, concat/append the dummy dataframe to the dataframe. df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1)
3.3. Feature Selection
3.3.1. Remove Highly Correlated Variables
As mentioned earlier, there are 63 variable pairs with absolute correlation > 0.9. Between the two variables in a pair, the one with higher missingness would be removed from the training model. As it may not needed by treebased algorithms, it helps improve the performance by having fewer features.
index  _var1  _var2  _var_corr  var1_na  var2_na 

0  v12  v10  0.912  49851  84 
1  v25  v8  0.943  49840  48619 
2  v32  v15  0.908  48619  49836 
3  v40  v34  0.903  49832  111 
4  v41  v29  0.904  49832  49832 
5  v43  v26  0.903  49832  49832 
...  ...  ...  ...  ...  ... 
56  v115  v69  0.994  49843  49895 
57  v116  v43  0.978  49832  49836 
58  v118  v97  0.962  49843  49843 
59  v121  v33  0.949  48654  49832 
60  v121  v83  0.966  48654  49832 
61  v128  v108  0.957  49832  48624 
62  v128  v109  0.903  49832  48624 
Table 3: Highly Correlated Variables
3.3.2. Treebased Feature Selection
Many treebased methods provides feature importance, which can be used to discard the irrelevant features. We used the ExtraTrees classifier and coupled it with the "SelectFromModel" in the scikitlearn package "feature_selection". SelectFromModel can be used along with any model that has a coef_ or feature_importances_ attribute after fitting. It also allows us to set the threshold for feature selection. Features whose importance is greater or equal are kept while the others are discarded. We picked 0.0003 as the threshold and got the 53 features returned:
['v1','v10','v101','v102','v105','v107','v108','v109','v110','v112', 'v113','v117','v119','v123','v124','v125','v129','v131','v14','v16', 'v2','v21','v22','v23','v24','v3','v30','v31','v34','v36','v38','v45', 'v47','v5','v50','v51','v52','v56','v58','v62','v64','v66','v69','v70', 'v71','v72','v74','v75','v78','v79','v82','v85','v87','v9','v91','v98']
def select_features_from_model(X_train, target_train, threshold): # Create a ExtraTreesClassifier with initial parameters model = ExtraTreesClassifier( n_estimators = 200, # Number of trees max_features = 0.8, # Number of features for each tree max_depth = 10, # Depth of the tree min_samples_split = 4, # Minimum number of samples required to split min_samples_leaf = 2, # Minimum number of samples in a leaf min_weight_fraction_leaf = 0, # Minimum weighted fraction of the input samples required to be at a leaf node. criterion = 'gini', # Use gini, not going to tune it random_state = 27, n_jobs = 1) model_fit = model.fit(X_train, target_train) model_select = SelectFromModel(model, threshold, prefit=True) model_select.transform(X_train) features_selected = X_train.columns[model_select.get_support()] features_dropped = X_train.columns[~model_select.get_support()] return (features_selected, features_dropped)
3.3.3. Univariate Feature Selection
We also tried univariate feature selection method which selects the best features based on univariate statistical tests. Again from the same package, we used the "SelectPercentile" method to perform a ANOVA Ftest for classification tasks to retrieve the best features. Know the important features from the ExtraTree classifier were around 50, which was approximately 50% of the features after removing some highly correlated variables. we were interested in the top 50% of the features. We retrieved 47 features this time.
['v103','v105','v108','v109','v110','v113','v117','v119','v122', 'v123','v124','v129','v13','v130','v14','v16','v21','v23','v24', 'v28','v31','v36','v38','v45','v47','v5','v51','v58','v62','v66', 'v68','v69','v70','v72','v73','v74','v78','v79','v80','v82','v83', 'v84','v85','v86','v87','v9','v98']
def select_features_univariate(X_train, target_train, percent): model_select = SelectPercentile(f_classif, percent) model_select.fit_transform(X_train, target_train) features_selected = X_train.columns[model_select.get_support()] features_dropped = X_train.columns[~model_select.get_support()] return (features_selected, features_dropped)
4. Model Building
4.1. Machine Learning Methods
In this competition, we have tried a number of different machine learning methods. Below is a quick review of the methods with a general description, their advantages and disadvantages.
Classifier  Description  Advantages  Disadvantages 

Logistic Regression  Use regression to solve binary classification problems by fitting a sigmoid funciton to calculate the probabilities 


Random Forest  Uses the random subsets of observations and features from the training data to create a number of decision trees; use averaging or voting to improve the predictive accuracy and control overfitting. 


Extra Trees Classifier  Is similar to Random Forest, fits a number of randomized decision trees on random subsets of observations and features; when choosing variables at a split, samples are drawn from the entire training set; splits are chosen completely at random from the range of values in the sample at each split. 


GBM (Gradient Boosting Model)  Trains trees sequentially, uses the information from the previous trees, then combined the whole set and averaging to provide a higher prediction; it uses the entire features, and does not ignore 'weak' learners. 


XGBoost (Extreme Gradient Boost)  Is an advanced implementation of gradient boosting algorithm; it has definitely boosting capabilities and additional components overcoming the weakness of GBM. 


Table 4: Summary of Machine Learning Methods
4.2. Procedure
After data cleaning, data imputation, and features selections, we started building our models. There were a few common steps during building our classification models:
 Training data sampling: We first trained our model with the full training set. However for some algorithms, it took very long time to tune the parameters and fit the model. Therefore about 30%40% of the training data were used.
 Model fitting for the first try: This gave us a first taste of how the models performed at the first try with some initial parameters.
 Parameter tuning: For parameter tuning, we did gridsearch using the GridSearchCV package from skitlearn. Since doing a complete gridsearch is time consuming on a big dataset on a local computer, we used coarse grid search first. After identifying a "better" region on the gird, a finer gird search on that region can be conducted. Also for some of the algorithms (e.g. ExtraTreeClassifier and Xgboost) we broke out the parameters into groups and tuned them group by group.
 Fit the best estimator with the full training set: After we found the best estimator, we refitted the model with full training data to get a better model. Usually more data is better.
 Prediction with test data: as the final step, the test data is passed to the model to get the prediction probability.
4.3. Parameters Tuning
Due to the time constraint, we didn't get a chance to finish building models for all the methods. The three models we were able to finish are: Logistic Regression, ExtraTreeClassifier, XGBoost. Below are details of our parameters tuning details for these models.
4.3.1. ExtraTreesClassifier
We first built a ExtraTreesClassifier model and initialized the parameters with some reasonable values. We fit the model with our training samples. It returned score (using Log Loss scoring metric) : 0.4432684.
############ Create a ExtraTreesClassifier with initial parameters ######## model = ExtraTreesClassifier( n_estimators = 100, # Number of trees max_features = 0.8, # Number of features for each tree max_depth = 10, # Depth of the tree min_samples_split = 2, # Minimum number of samples required to split min_samples_leaf = 1, # Minimum number of samples in a leaf min_weight_fraction_leaf = 0, # Minimum weighted fraction of the input samples required to be at a leaf node. criterion = 'gini', # Use gini, not going to tune it random_state = 27, n_jobs = 1)
Next, we were going to use GridSearchCV to tune the parameters. For ExtraTreesClassifier, we were tuning the following five parameters. Note that we didn't tune n_estimators (number of trees) yet as larger number of trees means slower process. We left it to the end to find a reasonable value n_estimators.
Tuning parameters:
 n_estimators (not tuned at the beginning)
 max_features
 max_depth
 min_samples_split
 min_samples_leaf
We started coarse parameter search based on the initial model. We assigned three values for each parameter (except n_estimators), which gave us 81 combinations.
####### Coarse Tune Parameters ####### para_grid = [{ 'max_features': [0.6, 0.75, 0.9], # Number of features for each tree 'max_depth': [5, 15, 25], # Depth of the tree 'min_samples_split': [5, 10, 50], # Minimum number of samples required to split an internal node 'min_samples_leaf': [5, 10, 50] # Minimum number of samples in a leaf }] start = datetime.datetime.now() para_search = GridSearchCV(model, para_grid, scoring = 'log_loss', cv = 5, n_jobs = 4).fit(X_train, target_train) end = datetime.datetime.now() print "model training time: {}".format(end  start)
The result of coarse grid search gave the best combination: {'max_features': 0.9, 'min_samples_split': 5, 'max_depth': 25, 'min_samples_leaf': 50}, and the best Score 0.4740551.
After we found the best combination from coarse search, we performed a finer search in a narrower search range around the values we found previously.
####### Now Fine Tune Parameters ####### para_grid = [{ 'max_features': [0.85, 0.9, 0.95],# Number of features for each tree 'max_depth': [20, 25, 30], # Depth of the tree 'min_samples_split': [3, 5, 7], # Minimum number of samples required to split 'min_samples_leaf': [45, 50, 55] # Minimum number of samples in a leaf }]
The grid search gave the best combination: {'max_features': 0.85, 'min_samples_split': 3, 'max_depth': 25, 'min_samples_leaf': 45}, the score was improved to: 0.4736519. We can see that max_depth remains 25 from this round. We used 25 as the final value for max_depth (we could have continued to tune around 25 if we had more time). However the other three parameters are all changed and thus need further search.
We repeated the same steps to fine tunes the rest of the parameters. In the last round, we searched n_estimators in the range of [100, 300, 500, 700]. The grid search showed that 700 gave better score. Given that 700 was already a big number of trees, we stopped here without going further testing a bigger number which would make it a very long time to train our data. The table below shows the ranges we used to perform the gird search for different values as well as result for each round.
n_estimators  max_features  max_depth  min_samples_split  min_samples_leaf  best combination  score 

100  [0.6, 0.75, 0.9]  [5, 15, 25]  [5, 10, 50]  [5, 10, 50] 

0.4740551 
100  [0.85, 0.9, 0.95]  [20, 25, 30]  [3, 5, 7]  [45, 50, 55] 

0.4736519 
100  [0.83, 0.85, 0.87]  [25]  [2, 3, 4]  [42, 45, 48] 

0.4735396 
100  [0.86, 0.87, 0.88]  [25]  [2]  [45] 

0.4735396 
[100, 300, 500, 700]  [0.87]  [25]  [2]  [45] 

0.4733141 
Table 5: ExtraTrees Parameters Search
We then fitted the model with the full training data and then got a much better training score 0.468914.
4.3.2. XGBoost
During building XGBoost model, we also built a base model and with some reasonable values for the parameters. We fit the model with our training samples. It returned score (using Log Loss scoring metric) : 0.434604.
######### Set initial values for the model ###### xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5, min_child_weight=1, missing=None, n_estimators=97, nthread=8, objective='binary:logistic', reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27, silent=True, subsample=0.8)
For XGB, we were tuning the following seven parameters.
 learning_rate: learning rate (not going to be tune at the beginning)
 max_depth: depth of the trees
 min_child_weight: minimum number of observations in a node
 gamma: minimum loss reduction to make a split
 subsample: number of selected rows for training sample
 colsample_bytree: number of features for training sample
 reg_alpha: L1 regularization term on weight (analogous to Lasso regression)
One really useful benefit of XGB is the builtin treepruning. It would stop generating more trees at certain point when more trees do not reduce the errors. Therefore we can estimate n_estimators from the initial fit without performing a search for it. Also we used a fixed learning rate of 0.1 to allow faster performance as lower learning rate causes longer time to fit the model. We left it to the end to tune it.
Since the number of combination is big with seven parameters, we took the step by step approach.
We tuned the max_depth and min_child_weight parameters first as they will have the highest impact on model outcome. We started with wider ranges and then we would perform another iteration for smaller ranges.
################# Tune max_depth and min_child_weight ############## param_test1 = { 'max_depth':range(3,10,2), 'min_child_weight':range(1,6,2) } gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=100, max_depth=5, \ min_child_weight=3, gamma=0.3, subsample=0.8, colsample_bytree=0.8, \ objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27), \ param_grid = param_test1, scoring='log_loss',n_jobs=4,iid=False, cv=5)
Here, we see that the best combination is: max_depth = 5, min_child_weight=5, the best score is 0.4710678. We were going to go one step deeper and look for optimum values. We would search for values 1 above and below the optimum values.
param_test2 = { 'max_depth':[4,5,6], 'min_child_weight':[4,5,6] }
Here, we got best combination is: max_depth = 6, min_child_weight= 6, the best score is: 0.4706758. The score is slightly better. We can see that the score is better with higher values. We continue to increase the values of both parameters in the next search.
param_test3 = { 'max_depth':[6,7], 'min_child_weight':[6,7,8] }
Now, we see that the optimal values are: max_depth = 6, min_child_weight=7, the best score is 0.470626.
Then we performed similar approach to tune gamma, subsample and colsample_bytree, reg_alpha separately.
Grid search gamma:
gamma  best value  score (log loss) 

[0, 0.2, 0.4, 0.6, 0.8]  0.4  0.4702224 
[0.3, 0.4, 0.5]  0.4  0.4702224 
Table 6: XGBoost Tune Parameter gamma
Grid search subsample and colsample_bytree:
subsample  colsample_bytree  best combination  score 

[0.6, 0.7, 0.8, 0.9]  [0.6, 0.7, 0.8, 0.9] 

0.4702224 
[0.75, 0.8, 0.85]  [0.75, 0.8, 0.85] 

0.4701294 
Table 7: XGBoost Tune Parameters subsample & colsample_bytree
Grid search reg_alpha:
reg_alpha  best value  score 

[1e5, 1e2, 0.1, 1, 100]  1e5  0.4701294 
[1e6, 1e5, 1e4]  0.0001  0.47001335 
Table 8: XGBoost Tune Parameter reg_alpha
Our last grid search ended with:
 colsample_bytree: 0.85
 gamma: 0.4
 max_depth: 6
 min_child_weight: 7
 reg_alpha: 0.0001
 subsample: 0.85
 Score: 0.47001335
We then fitted the model with the full training data and then got a much better training score 0.408851.
4.3.2. Logistic Regression
Parameter tuning in Logistic Regression is easier than the above treebased models. We just did grid search on three parameters and performed a onetime search.
 penalty: Used to specify the norm used in the penalization (L1 or L2)
 fit_intercept: Specifies if a constant should be added to the decision function
 C: Inverse of regularization strength; smaller values specify stronger regularization.
logit = LogisticRegression(random_state=27, n_jobs = 1) para_grid = [{'penalty': ['l1', 'l2'], 'fit_intercept': [False, True], 'C':np.logspace(5, 5, 10)}] para_search = GridSearchCV(logit, para_grid, scoring='log_loss', cv =5).fit(X_train, target_train)
We got the best combination:
 penalty = 'l1'
 C = 0.27825594022071259
 fit_intercept = True
This gave us the score 0.494125. We then fit the model with full training data and got a score 0.493337, which is not a big improvement.
4.4. Result
When we finally performed predictions with the test data using our three different models and submitted to Kaggle. Without any surprise based on the training score, XGB gave the best test score as well.
We also used model ensembling that combined our models to produce a hopefully improved results. Ensemble methods usually produces more accurate solutions than a single model would. We used a simple ensemble method: weighted voting. Using scikitlearn's VotingClassifier class, specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged.
Sample Code for voting ensembling with equal weight:
############ Try ensemble voting with average ########## ensemble_avg = VotingClassifier(estimators=[('lr', model_logit), ('xgb', model_xgb), ('extratree', model_extratree)], voting='soft', weights=[1,1,1]) ensemble_avg.fit(X_train, target_train)
We tried a few different combinations of selected models and weights, most of them didn't outscore the single XGBoost model. Only the combination of XGB + ExtraTrees with weight 3:2 slightly improved the score.
Table: Scores for different models
Classifiers  Weight  Training Score  Test Score 

Logistic Regression  N/A  0.49334  0.49533 
ExtraTrees  N/A  0.46891  0.47071 
XGB  N/A  0.408851  0.46311 
Logistic Regression + ExtraTrees + XGB  1:1:1  0.4372  0.46883 
Logistic Regression + ExtraTrees + XGB  1:3:2  0.4249  0.46501 
XGB + ExtraTrees  3:2  0.4134  0.46280 
Table 9: XGBoost Tune Parameter reg_alpha
5. Conclusion
The BNP Paribas Cardif Claims Management competition is challenging due to its anonymous variables and high amount of missing data. During the competition, quite a few different methods were used for data cleaning, data imputation, and feature selections. Different machine learning classifiers such as Logistics Regression, ExtraTrees, XGBoost (Random Forest and GBM were being built but not yet completed) were used for training and predictions. Grid search technique was applied to fine tune our parameters in order to get the best model for each classifier. We also used model ensembling method "weighted voting", which slightly improve the score from one of combinations.
Many of the Kagglers in the competition have used the similar methods that we did. However some of them have higher scores than than us and the rest of the participant. We believe that what has made the difference is feature engineering. Though we applied some simple feature selection techniques such as treebased feature importance and univariate feature selection, but obviously those were not enough to make a big improvement on the prediction accuracy. Therefore for our future work, we would like to look deeper on the data and do more feature engineering.
Reference
Jain, AArshay (2016, March 1). Complete Guide to Parameter Tuning in XGBoost (with codes in Python). Retrieved from http://www.analyticsvidhya.com/blog/2016/03/completeguideparametertuningxgboostwithcodespython/