NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Kaggle Competition: BNP Paribas Cardif Claims Management

Kaggle Competition: BNP Paribas Cardif Claims Management

Bin Lin, Sung Pil Moon and Yun-Ju Huang
Posted on Apr 24, 2016

Contributed by Ablat, Bin Lin, Claudia Huang and Sung Pil Moon. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Jan. 11th to April 1st, 2016.This post is based on their fourth class project - Machine Learning (due on the 8th week of the program).

 

1. Overview

As part of a Kaggle competition, we are challenged to help BNP Paribas cardif to accelerate their claims management process in order to provide a better service to its customers.

In today's world, everything is going faster and faster. When facing unexpected events, customers expect their insurer to support them as soon as possible when facing unexpected events. However, claims management may require different levels of check before a claim can be approved and a payment can be made. In this challenge, we are going to classify the claims into the following two categories:

  1. claims for which approval could be accelerated leading to faster payments
  2. claims for which additional information is required before approval

In this project, we have done:

  • Exploratory data analysis
  • Feature engineering
  • Missing data imputation
  • Building prediction models with various machine learning algorithms, including Logistic Regression, Decision Trees, Random Forest, Xgboost, Neural Network.

 

2. The Data set

2.1. Data Summary

The data provided by BNP Paribas cardif are:

  • TRAIN.CSV: traing dataset
  • TEST.CSV: test dataset
  • SAMPLE_SUBMISSION.CSV: example of submission format

These are the basic description of the training dataset:

  • Total: 111,432 observations (rows) x 133 features (columns)
  • Data Types: float, integer, string
  • Column "ID": the ID of each row, not being used as predictor
  • Column "target": the target of each row, not being used as predictor
  • Numbers of Text-based Predictors: 112
  • Numbers of Numeric Predictors: 19
  • Numbers of Columns with missing value: 119
  • Numbers of columns highly correlated: There are 123 pairs with absolute correlation > 0.8; there are 63 pairs with absolute correlation > 0.9.

2.2. Exploratory Data Analysis

As the first step, we performed exploratory data analysis to better understand the data. These are the initial findings throughout the EDA process.

  • Anonymized Data: All data (both categorical and continuous) is anonymized without any description
ID target v1 v2 v3 ... v127 v128 v129 v130 v131
3 1 1.335739 8.727474 C ... 3.113719 2.024285 0 0.636365 2.857144
4 1 NaN NaN C ... NaN 1.957825 0 NaN NaN
5 1 0.943877 5.310079 C ... 3.922193 1.120468 2 0.883118 1.176472
6 1 0.797415 8.304757 C ... 2.954381 1.990847 1 1.677108 1.034483
8 1 NaN NaN C ... NaN NaN 0 NaN NaN
Table 1: Data First Peek
  • Too Many Missing Value: Approximately 40% of data is missing as you can see in the Figure 1.
NAsPatternEq

Figure 1: Missingness in the BNP data set

  • High correlations: As shown in the correlation matrix plot (Figure 2), some variables are highly correlated (not just 1-to-many but also many-to-many). Correlation values are between -1 and 1. Red color means positive correlation while blue color means negative correlation.
hicorrelationmat

Figure 2: Correlation Matrix Plot

 

3. Data Pre-Processing

Data pre-processing is an important step in our before we can even start building our models. Our data pre-processing includes, data cleaning, transformation, feature selection.

3.1. Data Cleaning

3.1.1. Treatment of Integer Variables with Low Numbers of Unique Values

There are four of the integer variables with low numbers of unique values: v38, v62, v72, v129. These will be treated as categorical variables. These variables would be factorized using binary dummies.

index _a_variable _b_data_type _c_cardinality _d_missings _e_sample_values
39 v38 int64 12 0 [0, 4]
63 v62 int64 8 0 [1, 2]
73 v72 int64 13 0 [1, 2]
130 v129 int64 10 0 [0, 2]
Table 2: Integer Variables with Low Cardinality

3.1.2. Imputation

Due to the high missingess in the dataset, we have sent a lot of time on data exploratory and figuring out the best imputation methods. The following are the methods we have tried.

  • Numeric Imputation with Mean: We chose to imputed the numeric variables with mean to start with our data training.
  • Numeric Imputation with Interpolate Linear
  • Numeric Imputation with -999: impute with extreme numbers
  • Categorical Imputation with "NA": Dropping all missing value in the categorical values does not help increasing the higher predictability. Some missingness itself can provide information, such as for the question like "How long have been in a marriage?", NA means 0 years of marriage.
  • Prediction Model: which uses supervised algorithms (Linear regression, KNN, Tree-based, etc) to predict the variables with missingness based on other variables. When we tried this method with KNN and Decision Tree, it took forever to run and never been able to finish. Due to tight timeline, we ended up giving up on this method.

As some experiment, we go with imputing numeric variables with -999, and categorical variables with text "NA" as these two options give us a better results.

 

3.2. Data Transformation

Many machine learning tool, including Python, will only accept numbers as input. This was a problem as Python was being used in our project. Fortunately, the Pandas package has get_dummies() function, which converts categorical variable into dummy/indicator variables.

Sample code: convert categorical data into binary dummy variables

# convert text-based columns to dummies 
for var_name in cate_variables:
    dummies = pd.get_dummies(df[var_name], prefix=var_name)
        
    # Drop the current variable, concat/append the dummy dataframe to the dataframe.
    df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1)

 

3.3. Feature Selection

3.3.1. Remove Highly Correlated Variables

As mentioned earlier, there are 63 variable pairs with absolute correlation > 0.9. Between the two variables in a pair, the one with higher missingness would be removed from the training model. As it may not needed by tree-based algorithms, it helps improve the performance by having fewer features.

index _var1 _var2 _var_corr var1_na var2_na
0 v12 v10 0.912 49851 84
1 v25 v8 0.943 49840 48619
2 v32 v15 0.908 48619 49836
3 v40 v34 -0.903 49832 111
4 v41 v29 0.904 49832 49832
5 v43 v26 0.903 49832 49832
... ... ... ... ... ...
56 v115 v69 -0.994 49843 49895
57 v116 v43 0.978 49832 49836
58 v118 v97 0.962 49843 49843
59 v121 v33 0.949 48654 49832
60 v121 v83 0.966 48654 49832
61 v128 v108 0.957 49832 48624
62 v128 v109 0.903 49832 48624
Table 3: Highly Correlated Variables

3.3.2. Tree-based Feature Selection

Many tree-based methods provides feature importance, which can be used to discard the irrelevant features. We used the ExtraTrees classifier and coupled it with the "SelectFromModel" in the scikit-learn package "feature_selection".  SelectFromModel can be used along with any model that has a coef_ or feature_importances_ attribute after fitting. It also allows us to set the threshold for feature selection.  Features whose importance is greater or equal are kept while the others are discarded. We picked 0.0003 as the threshold and got the 53 features returned:

['v1','v10','v101','v102','v105','v107','v108','v109','v110','v112',
'v113','v117','v119','v123','v124','v125','v129','v131','v14','v16',
'v2','v21','v22','v23','v24','v3','v30','v31','v34','v36','v38','v45',
'v47','v5','v50','v51','v52','v56','v58','v62','v64','v66','v69','v70',
'v71','v72','v74','v75','v78','v79','v82','v85','v87','v9','v91','v98']
def select_features_from_model(X_train, target_train, threshold):
    # Create a ExtraTreesClassifier with initial parameters 
    model = ExtraTreesClassifier(    
        n_estimators = 200,             # Number of trees
        max_features = 0.8,             # Number of features for each tree
        max_depth = 10,                 # Depth of the tree
        min_samples_split = 4,          # Minimum number of samples required to split
        min_samples_leaf = 2,           # Minimum number of samples in a leaf
        min_weight_fraction_leaf = 0,   # Minimum weighted fraction of the input samples required to be at a leaf node. 
        criterion = 'gini',             # Use gini, not going to tune it
        random_state = 27,
        n_jobs = -1)

    model_fit = model.fit(X_train, target_train)

    model_select = SelectFromModel(model, threshold, prefit=True)
    model_select.transform(X_train)

    features_selected = X_train.columns[model_select.get_support()]
    features_dropped = X_train.columns[~model_select.get_support()]

    return (features_selected, features_dropped)

 

3.3.3. Univariate Feature Selection

We also tried univariate feature selection method which selects the best features based on univariate statistical tests. Again from the same package, we used the "SelectPercentile" method to perform a ANOVA F-test for classification tasks to retrieve the best features. Know the important features from the ExtraTree classifier were around 50, which was approximately 50% of the features after removing some highly correlated variables. we were interested in the top 50% of the features. We retrieved 47 features this time.

['v103','v105','v108','v109','v110','v113','v117','v119','v122',
'v123','v124','v129','v13','v130','v14','v16','v21','v23','v24',
'v28','v31','v36','v38','v45','v47','v5','v51','v58','v62','v66',
 'v68','v69','v70','v72','v73','v74','v78','v79','v80','v82','v83',
 'v84','v85','v86','v87','v9','v98']
def select_features_univariate(X_train, target_train, percent):
    model_select = SelectPercentile(f_classif, percent)
    model_select.fit_transform(X_train, target_train)

    features_selected = X_train.columns[model_select.get_support()]
    features_dropped = X_train.columns[~model_select.get_support()]

    return (features_selected, features_dropped)

 

 

4. Model Building

4.1. Machine Learning Methods

In this competition, we have tried a number of different machine learning methods. Below is a quick review of the methods with a general description, their advantages and disadvantages.

Classifier Description Advantages Disadvantages
Logistic Regression Use regression to solve binary classification problems by fitting a sigmoid funciton to calculate the probabilities
  • Intrinsically simple and fast
  • Easy to interpret
  • Does not need to follow all Linear Regression assumption (normality, constant variance, linearity)
  • Low variance
  • May have multicollinearity (without L2 regularization)
  • Doesn't perform well with large number of features
  • Lack of flexibility, e.g. when linear decision boundary is not valid
Random Forest Uses the random subsets of observations and features from the training data to create a number of decision trees; use averaging or voting to improve the predictive accuracy and control over-fitting.
  • Prevent from overfiting
  • Good with very large data set
  • No transformation needed
  • Robust against outliers
  • Low interpretability
  • Less accurate than boosted tree models
Extra Trees Classifier Is similar to Random Forest, fits a number of randomized decision trees on random subsets of observations and features; when choosing variables at a split, samples are drawn from the entire training set; splits are chosen completely at random from the range of values in the sample at each split.
  • The same as Random Forest
  • More random thus lower variance
  • The same as Random Forest
  • Computation can grow much bigger
GBM (Gradient Boosting Model) Trains trees sequentially, uses the information from the previous trees, then combined the whole set and averaging to provide a higher prediction; it uses the entire features, and does not ignore 'weak' learners.
  • Use all features (with 'weak' learners meaning that it does not ignore less important features)
  • Better predictability (= higher accuracy)
  • Lower possibility of overfitting with more trees
  • Susceptible to outliers since the model uses all the features
  • Lack of interpretability and higher complexity, compared to linear classifiers.
  • Harder to tune hyperparameters than other models
  • Slow to train or score
XGBoost (Extreme Gradient Boost) Is an advanced implementation of gradient boosting algorithm; it has definitely boosting capabilities and additional components overcoming the weakness of GBM.
  • Regularization: it reduces overfitting (Standard GBM has no regularization)
  • Support parallel processing
  • Built-in routine to handle missing values.
  • Tree pruning: XGBoost make splits up to the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain.
  • Built-in Cross Validation
  • The same as GBM (except in general faster than GBM)
Table 4: Summary of Machine Learning Methods

 

4.2. Procedure

After data cleaning, data imputation, and features selections, we started building our models. There were a few common steps during building our classification models:

  1. Training data sampling: We first trained our model with the full training set. However for some algorithms, it took very long time to tune the parameters and fit the model. Therefore about 30%-40% of the training data were used.
  2. Model fitting for the first try: This gave us a first taste of how the models performed at the first try with some initial parameters.
  3. Parameter tuning: For parameter tuning, we did grid-search using the GridSearchCV package from skitlearn. Since doing a complete grid-search is time consuming on a big dataset on a local computer, we used coarse grid search first. After identifying a "better" region on the gird, a finer gird search on that region can be conducted. Also for some of the algorithms (e.g. ExtraTreeClassifier and Xgboost) we broke out the parameters into groups and tuned them group by group.
  4. Fit the best estimator with the full training set: After we found the best estimator, we refitted the model with full training data to get a better model. Usually more data is better.
  5. Prediction with test data: as the final step, the test data is passed to the model to get the prediction probability.

 

4.3. Parameters Tuning

Due to the time constraint, we didn't get a chance to finish building models for all the methods. The three models we were able to finish are: Logistic Regression, ExtraTreeClassifier, XGBoost. Below are details of our parameters tuning details for these models.

4.3.1. ExtraTreesClassifier

We first built a ExtraTreesClassifier model and initialized the parameters with some reasonable values. We fit the model with our training samples.  It returned score (using  Log Loss scoring metric) : 0.4432684.

############ Create a ExtraTreesClassifier with initial parameters ########
model = ExtraTreesClassifier(    
    n_estimators = 100,             # Number of trees
    max_features = 0.8,             # Number of features for each tree
    max_depth = 10,                 # Depth of the tree
    min_samples_split = 2,          # Minimum number of samples required to split
    min_samples_leaf = 1,           # Minimum number of samples in a leaf
    min_weight_fraction_leaf = 0,   # Minimum weighted fraction of the input samples required to be at a leaf node. 
    criterion = 'gini',             # Use gini, not going to tune it
    random_state = 27,
    n_jobs = -1)

 

Next, we were going to use GridSearchCV to tune the parameters. For ExtraTreesClassifier, we were tuning the following five parameters. Note that we didn't tune n_estimators (number of trees) yet as larger number of trees means slower process. We left it to the end to find a reasonable value n_estimators.

Tuning parameters:

  • n_estimators (not tuned at the beginning)
  • max_features
  • max_depth
  • min_samples_split
  • min_samples_leaf

We started coarse parameter search based on the initial model.  We assigned three values for each parameter (except n_estimators), which gave us 81 combinations.

####### Coarse Tune Parameters #######
para_grid = [{    
    'max_features': [0.6, 0.75, 0.9],  # Number of features for each tree
    'max_depth': [5, 15, 25],          # Depth of the tree
    'min_samples_split': [5, 10, 50],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [5, 10, 50]    # Minimum number of samples in a leaf
    }]

start = datetime.datetime.now()
para_search = GridSearchCV(model, para_grid, scoring = 'log_loss', cv = 5, n_jobs = 4).fit(X_train, target_train)
end = datetime.datetime.now()
print "model training time: {}".format(end - start)

The result of coarse grid search gave the best combination: {'max_features': 0.9, 'min_samples_split': 5, 'max_depth': 25, 'min_samples_leaf': 50}, and the best Score 0.4740551.

After we found the best combination from coarse search, we performed a finer search in a narrower search range around the values we found previously.

####### Now Fine Tune Parameters #######
para_grid = [{    
    'max_features': [0.85, 0.9, 0.95],# Number of features for each tree
    'max_depth': [20, 25, 30],        # Depth of the tree
    'min_samples_split': [3, 5, 7],   # Minimum number of samples required to split
    'min_samples_leaf': [45, 50, 55]  # Minimum number of samples in a leaf
    }]

The grid search gave the best combination: {'max_features': 0.85, 'min_samples_split': 3, 'max_depth': 25, 'min_samples_leaf': 45}, the score was improved to:  0.4736519. We can see that max_depth remains 25 from this round. We used 25 as the final value for max_depth (we could have continued to tune around 25 if we had more time). However the other three parameters are all changed and thus need further search.

We repeated the same steps to fine tunes the rest of the parameters. In the last round, we searched n_estimators in the range of [100, 300, 500, 700]. The grid search showed that 700 gave better score. Given that 700 was already a big number of trees, we stopped here without going further testing a bigger number which would make it a very long time to train our data. The table below shows the ranges we used to perform the gird search for different values as well as result for each round.

n_estimators max_features max_depth min_samples_split min_samples_leaf best combination score
100 [0.6, 0.75, 0.9] [5, 15, 25] [5, 10, 50] [5, 10, 50]
  • max_features: 0.9
  • min_samples_split: 5
  • max_depth: 25
  • min_samples_leaf: 50
0.4740551
100 [0.85, 0.9, 0.95] [20, 25, 30] [3, 5, 7] [45, 50, 55]
  • max_features: 0.85
  • min_samples_split: 3
  • max_depth: 25
  • min_samples_leaf: 45
0.4736519
100 [0.83, 0.85, 0.87] [25] [2, 3, 4] [42, 45, 48]
  • 'max_features': 0.87
  • 'min_samples_split': 2
  • 'max_depth': 25
  • 'min_samples_leaf': 45
0.4735396
100 [0.86, 0.87, 0.88] [25] [2] [45]
  • 'max_features': 0.87
  • 'min_samples_split': 2
  • 'max_depth': 25
  • 'min_samples_leaf': 45
0.4735396
[100, 300, 500, 700] [0.87] [25] [2] [45]
  • 'n_estimators': 700
  • 'max_features': 0.87
  • 'min_samples_split': 2
  • 'max_depth': 25
  • 'min_samples_leaf': 45
0.4733141
Table 5: ExtraTrees Parameters Search

We then fitted the model with the full training data and then got a much better training score 0.468914.

4.3.2. XGBoost

During building XGBoost model, we also built a base model and with some reasonable values for the parameters. We fit the model with our training samples.  It returned score (using  Log Loss scoring metric) : 0.434604.

######### Set initial values for the model ######
xgb_model = XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=8,
    scale_pos_weight=1,
    seed=27)
xgb_feature_importance

Figure 3: XGBoost Feature Importance

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=None, n_estimators=97, nthread=8,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=27, silent=True, subsample=0.8)

For XGB, we were tuning the following seven parameters.

  • learning_rate: learning rate (not going to be tune at the beginning)
  • max_depth: depth of the trees
  • min_child_weight: minimum number of observations in a node
  • gamma: minimum loss reduction to make a split
  • subsample: number of selected rows for training sample
  • colsample_bytree: number of features for training sample
  • reg_alpha: L1 regularization term on weight (analogous to Lasso regression)

One really useful benefit of XGB is the built-in tree-pruning. It would stop generating more trees at certain point when more trees do not reduce the errors. Therefore we can estimate n_estimators from the initial fit without performing a search for it. Also we used a fixed learning rate of 0.1 to allow faster performance as lower learning rate causes longer time to fit the model. We left it to the end to tune it.

Since the number of combination is big with seven parameters, we took the step by step approach.

We tuned the max_depth and min_child_weight parameters first as they will have the highest impact on model outcome. We started with wider ranges and then we would perform another iteration for smaller ranges.

################# Tune max_depth and min_child_weight ##############
param_test1 = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2)
}

gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=100, max_depth=5, \
    min_child_weight=3, gamma=0.3, subsample=0.8, colsample_bytree=0.8, \
    objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27), \
    param_grid = param_test1, scoring='log_loss',n_jobs=4,iid=False, cv=5)

Here, we see that the best combination is: max_depth = 5, min_child_weight=5, the best score is 0.4710678. We were going to go one step deeper and look for optimum values. We would search for values 1 above and below the optimum values.

param_test2 = {
    'max_depth':[4,5,6],
    'min_child_weight':[4,5,6]
}

Here, we got best combination is: max_depth = 6, min_child_weight= 6, the best score is: 0.4706758. The score is slightly better. We can see that the score is better with higher values. We continue to increase the values of both parameters in the next search.

param_test3 = {
    'max_depth':[6,7],
    'min_child_weight':[6,7,8]
}

Now, we see that the optimal values are: max_depth = 6, min_child_weight=7, the best score is 0.470626.

Then we performed similar approach to  tune gamma, subsample and colsample_bytree, reg_alpha separately.

Grid search gamma:

gamma best value score (log loss)
[0, 0.2, 0.4, 0.6, 0.8] 0.4 0.4702224
[0.3, 0.4, 0.5] 0.4 0.4702224
Table 6: XGBoost Tune Parameter gamma

Grid search subsample and colsample_bytree:

subsample colsample_bytree best combination score
[0.6, 0.7, 0.8, 0.9] [0.6, 0.7, 0.8, 0.9]
  • colsample_bytree: 0.8
  • subsample: 0.8
0.4702224
[0.75, 0.8, 0.85] [0.75, 0.8, 0.85]
  • colsample_bytree: 0.85
  • subsample: 0.85
0.4701294
Table 7: XGBoost Tune Parameters subsample & colsample_bytree

Grid search reg_alpha:

reg_alpha best value score
[1e-5, 1e-2, 0.1, 1, 100] 1e-5 0.4701294
[1e-6, 1e-5, 1e-4] 0.0001 0.47001335
Table 8: XGBoost Tune Parameter reg_alpha

Our last grid search ended with:

  • colsample_bytree: 0.85
  • gamma: 0.4
  • max_depth: 6
  • min_child_weight: 7
  • reg_alpha: 0.0001
  • subsample: 0.85
  • Score: -0.47001335

We then fitted the model with the full training data and then got a much better training score -0.408851.

4.3.2. Logistic Regression

Parameter tuning in Logistic Regression is easier than the above tree-based models. We just did grid search on three parameters and performed a one-time search.

  • penalty: Used to specify the norm used in the penalization (L1 or L2)
  • fit_intercept: Specifies if a constant should be added to the decision function
  • C: Inverse of regularization strength; smaller values specify stronger regularization.

logit = LogisticRegression(random_state=27, n_jobs = -1)

para_grid = [{'penalty': ['l1', 'l2'], 
              'fit_intercept': [False, True], 
              'C':np.logspace(-5, 5, 10)}]

para_search = GridSearchCV(logit, para_grid, scoring='log_loss', cv =5).fit(X_train, target_train)

We got the best combination:

  • penalty = 'l1'
  • C = 0.27825594022071259
  • fit_intercept = True

This gave us the score -0.494125. We then fit the model with full training data and got a score 0.493337, which is not a big improvement.

 

4.4. Result

When we finally performed predictions with the test data using our three different models and submitted to Kaggle. Without any surprise based on the training score, XGB gave the best test score as well.

We also used model ensembling that combined our models to produce a hopefully improved results. Ensemble methods usually produces more accurate solutions than a single model would. We used a simple ensemble method: weighted voting. Using scikit-learn's VotingClassifier class, specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged.

Sample Code for voting ensembling with equal weight:

############ Try ensemble voting with average ##########
ensemble_avg = VotingClassifier(estimators=[('lr', model_logit), ('xgb', model_xgb), ('extratree', model_extratree)],
voting='soft', weights=[1,1,1])

ensemble_avg.fit(X_train, target_train)

 

We tried a few different combinations of selected models and weights, most of them didn't outscore the single XGBoost model. Only the combination of XGB + ExtraTrees with weight 3:2 slightly improved the score.

Table: Scores for different models

Classifiers Weight Training Score Test Score
Logistic Regression N/A  0.49334  0.49533
ExtraTrees N/A 0.46891  0.47071
XGB N/A  0.408851  0.46311
Logistic Regression + ExtraTrees + XGB 1:1:1 0.4372 0.46883
Logistic Regression + ExtraTrees + XGB 1:3:2 0.4249 0.46501
XGB + ExtraTrees 3:2 0.4134 0.46280
Table 9: XGBoost Tune Parameter reg_alpha

 

5. Conclusion

The BNP Paribas Cardif Claims Management competition is challenging due to its anonymous variables and high amount of missing data. During the competition, quite a few different methods were used for data cleaning, data imputation, and feature selections. Different machine learning classifiers such as Logistics Regression, ExtraTrees, XGBoost (Random Forest and GBM were being built but not yet completed) were used for training and predictions. Grid search technique was applied to fine tune our parameters in order to get the best model for each classifier. We also used model ensembling method "weighted voting", which slightly improve the score from one of combinations.

Many of the Kagglers in the competition have used the similar methods that we did. However some of them have higher scores than than us and the rest of the participant. We believe that what has made the difference is feature engineering. Though we applied some simple feature selection techniques such as tree-based feature importance and univariate feature selection, but obviously those were not enough to make a big improvement on the prediction accuracy. Therefore for our future work, we would like to look deeper on the data and do more feature engineering.

 

Reference

Jain, AArshay (2016, March 1). Complete Guide to Parameter Tuning in XGBoost (with codes in Python). Retrieved from http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

About Authors

Bin Lin

Bin is a former professional in software development with now positioned for Data Scientist role. With both strong programming skill and newly acquired machine learning skills, Bin is able to apply scientific thinking to predict or uncover insights...
View all posts by Bin Lin >

Sung Pil Moon

Sung Moon is a recent graduate from the Ph.D. program in Human-Computer Interaction, School of Informatics, Indiana University (Indianapolis, IN). Through several startup activities and various research projects collaborating with MITRE, a research corporation, he found opportunities to...
View all posts by Sung Pil Moon >

Yun-Ju Huang

A recent graduate of the NYU Integrated Marketing program, Claudia Huang specializes in marketing analysis. Immersed in a creative, trend-sensitive environment, she learned to integrate marketing channels, acquired data analytical skills, and forged a branding mindset. When she...
View all posts by Yun-Ju Huang >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

VincentOr May 7, 2016
http://fds9923sdsd.co qqebt

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application