Kaggle Competition: BNP Paribas Cardif Claims Management

Bin Lin, Sung Pil Moon and Yun-Ju Huang

Posted on Apr 24, 2016

Contributed by Ablat, Bin Lin, Claudia Huang and Sung Pil Moon. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Jan. 11th to April 1st, 2016.This post is based on their fourth class project - Machine Learning (due on the 8th week of the program).

1. Overview

As part of a Kaggle competition, we are challenged to help BNP Paribas cardif to accelerate their claims management process in order to provide a better service to its customers.

In today's world, everything is going faster and faster. When facing unexpected events, customers expect their insurer to support them as soon as possible when facing unexpected events. However, claims management may require different levels of check before a claim can be approved and a payment can be made. In this challenge, we are going to classify the claims into the following two categories:

claims for which approval could be accelerated leading to faster payments
claims for which additional information is required before approval

In this project, we have done:

Exploratory data analysis
Feature engineering
Missing data imputation
Building prediction models with various machine learning algorithms, including Logistic Regression, Decision Trees, Random Forest, Xgboost, Neural Network.

2. The Data set

2.1. Data Summary

The data provided by BNP Paribas cardif are:

TRAIN.CSV: traing dataset
TEST.CSV: test dataset
SAMPLE_SUBMISSION.CSV: example of submission format

These are the basic description of the training dataset:

Total: 111,432 observations (rows) x 133 features (columns)
Data Types: float, integer, string
Column "ID": the ID of each row, not being used as predictor
Column "target": the target of each row, not being used as predictor
Numbers of Text-based Predictors: 112
Numbers of Numeric Predictors: 19
Numbers of Columns with missing value: 119
Numbers of columns highly correlated: There are 123 pairs with absolute correlation > 0.8; there are 63 pairs with absolute correlation > 0.9.

2.2. Exploratory Data Analysis

As the first step, we performed exploratory data analysis to better understand the data. These are the initial findings throughout the EDA process.

Anonymized Data: All data (both categorical and continuous) is anonymized without any description

ID	target	v1	v2	v3	...	v127	v128	v129	v130	v131
3	1	1.335739	8.727474	C	...	3.113719	2.024285	0	0.636365	2.857144
4	1	NaN	NaN	C	...	NaN	1.957825	0	NaN	NaN
5	1	0.943877	5.310079	C	...	3.922193	1.120468	2	0.883118	1.176472
6	1	0.797415	8.304757	C	...	2.954381	1.990847	1	1.677108	1.034483
8	1	NaN	NaN	C	...	NaN	NaN	0	NaN	NaN

Table 1: Data First Peek

Too Many Missing Value: Approximately 40% of data is missing as you can see in the Figure 1.

Figure 1: Missingness in the BNP data set

High correlations: As shown in the correlation matrix plot (Figure 2), some variables are highly correlated (not just 1-to-many but also many-to-many). Correlation values are between -1 and 1. Red color means positive correlation while blue color means negative correlation.

Figure 2: Correlation Matrix Plot

3. Data Pre-Processing

Data pre-processing is an important step in our before we can even start building our models. Our data pre-processing includes, data cleaning, transformation, feature selection.

3.1. Data Cleaning

3.1.1. Treatment of Integer Variables with Low Numbers of Unique Values

There are four of the integer variables with low numbers of unique values: v38, v62, v72, v129. These will be treated as categorical variables. These variables would be factorized using binary dummies.

index	_a_variable	_b_data_type	_c_cardinality	_e_sample_values
39	v38	int64	12	[0, 4]
63	v62	int64	8	[1, 2]
73	v72	int64	13	[1, 2]
130	v129	int64	10	[0, 2]

Table 2: Integer Variables with Low Cardinality

3.1.2. Imputation

Due to the high missingess in the dataset, we have sent a lot of time on data exploratory and figuring out the best imputation methods. The following are the methods we have tried.

Numeric Imputation with Mean: We chose to imputed the numeric variables with mean to start with our data training.
Numeric Imputation with Interpolate Linear
Numeric Imputation with -999: impute with extreme numbers
Categorical Imputation with "NA": Dropping all missing value in the categorical values does not help increasing the higher predictability. Some missingness itself can provide information, such as for the question like "How long have been in a marriage?", NA means 0 years of marriage.
Prediction Model: which uses supervised algorithms (Linear regression, KNN, Tree-based, etc) to predict the variables with missingness based on other variables. When we tried this method with KNN and Decision Tree, it took forever to run and never been able to finish. Due to tight timeline, we ended up giving up on this method.

As some experiment, we go with imputing numeric variables with -999, and categorical variables with text "NA" as these two options give us a better results.

3.2. Data Transformation

Many machine learning tool, including Python, will only accept numbers as input. This was a problem as Python was being used in our project. Fortunately, the Pandas package has get_dummies() function, which converts categorical variable into dummy/indicator variables.

Sample code: convert categorical data into binary dummy variables

# convert text-based columns to dummies 
for var_name in cate_variables:
    dummies = pd.get_dummies(df[var_name], prefix=var_name)
        
    # Drop the current variable, concat/append the dummy dataframe to the dataframe.
    df_new = pd.concat([df_new.drop(var_name, 1), dummies.iloc[:,1:]], axis = 1)

3.3. Feature Selection

3.3.1. Remove Highly Correlated Variables

As mentioned earlier, there are 63 variable pairs with absolute correlation > 0.9. Between the two variables in a pair, the one with higher missingness would be removed from the training model. As it may not needed by tree-based algorithms, it helps improve the performance by having fewer features.

index	_var1	_var2	_var_corr	var1_na	var2_na
0	v12	v10	0.912	49851	84
1	v25	v8	0.943	49840	48619
2	v32	v15	0.908	48619	49836
3	v40	v34	-0.903	49832	111
4	v41	v29	0.904	49832	49832
5	v43	v26	0.903	49832	49832
...	...	...	...	...	...
56	v115	v69	-0.994	49843	49895
57	v116	v43	0.978	49832	49836
58	v118	v97	0.962	49843	49843
59	v121	v33	0.949	48654	49832
60	v121	v83	0.966	48654	49832
61	v128	v108	0.957	49832	48624
62	v128	v109	0.903	49832	48624

Table 3: Highly Correlated Variables

3.3.2. Tree-based Feature Selection

Many tree-based methods provides feature importance, which can be used to discard the irrelevant features. We used the ExtraTrees classifier and coupled it with the "SelectFromModel" in the scikit-learn package "feature_selection". SelectFromModel can be used along with any model that has a coef_ or feature_importances_ attribute after fitting. It also allows us to set the threshold for feature selection. Features whose importance is greater or equal are kept while the others are discarded. We picked 0.0003 as the threshold and got the 53 features returned:

['v1','v10','v101','v102','v105','v107','v108','v109','v110','v112',
'v113','v117','v119','v123','v124','v125','v129','v131','v14','v16',
'v2','v21','v22','v23','v24','v3','v30','v31','v34','v36','v38','v45',
'v47','v5','v50','v51','v52','v56','v58','v62','v64','v66','v69','v70',
'v71','v72','v74','v75','v78','v79','v82','v85','v87','v9','v91','v98']

def select_features_from_model(X_train, target_train, threshold):
    # Create a ExtraTreesClassifier with initial parameters 
    model = ExtraTreesClassifier(    
        n_estimators = 200,             # Number of trees
        max_features = 0.8,             # Number of features for each tree
        max_depth = 10,                 # Depth of the tree
        min_samples_split = 4,          # Minimum number of samples required to split
        min_samples_leaf = 2,           # Minimum number of samples in a leaf
        min_weight_fraction_leaf = 0,   # Minimum weighted fraction of the input samples required to be at a leaf node. 
        criterion = 'gini',             # Use gini, not going to tune it
        random_state = 27,
        n_jobs = -1)

    model_fit = model.fit(X_train, target_train)

    model_select = SelectFromModel(model, threshold, prefit=True)
    model_select.transform(X_train)

    features_selected = X_train.columns[model_select.get_support()]
    features_dropped = X_train.columns[~model_select.get_support()]

    return (features_selected, features_dropped)

3.3.3. Univariate Feature Selection

We also tried univariate feature selection method which selects the best features based on univariate statistical tests. Again from the same package, we used the "SelectPercentile" method to perform a ANOVA F-test for classification tasks to retrieve the best features. Know the important features from the ExtraTree classifier were around 50, which was approximately 50% of the features after removing some highly correlated variables. we were interested in the top 50% of the features. We retrieved 47 features this time.

['v103','v105','v108','v109','v110','v113','v117','v119','v122',
'v123','v124','v129','v13','v130','v14','v16','v21','v23','v24',
'v28','v31','v36','v38','v45','v47','v5','v51','v58','v62','v66',
 'v68','v69','v70','v72','v73','v74','v78','v79','v80','v82','v83',
 'v84','v85','v86','v87','v9','v98']

def select_features_univariate(X_train, target_train, percent):
    model_select = SelectPercentile(f_classif, percent)
    model_select.fit_transform(X_train, target_train)

    features_selected = X_train.columns[model_select.get_support()]
    features_dropped = X_train.columns[~model_select.get_support()]

    return (features_selected, features_dropped)

4. Model Building

4.1. Machine Learning Methods

In this competition, we have tried a number of different machine learning methods. Below is a quick review of the methods with a general description, their advantages and disadvantages.

Classifier	Description	Advantages	Disadvantages
Logistic Regression	Use regression to solve binary classification problems by fitting a sigmoid funciton to calculate the probabilities	Intrinsically simple and fast Easy to interpret Does not need to follow all Linear Regression assumption (normality, constant variance, linearity) Low variance	May have multicollinearity (without L2 regularization) Doesn't perform well with large number of features Lack of flexibility, e.g. when linear decision boundary is not valid
Random Forest	Uses the random subsets of observations and features from the training data to create a number of decision trees; use averaging or voting to improve the predictive accuracy and control over-fitting.	Prevent from overfiting Good with very large data set No transformation needed Robust against outliers	Low interpretability Less accurate than boosted tree models
Extra Trees Classifier	Is similar to Random Forest, fits a number of randomized decision trees on random subsets of observations and features; when choosing variables at a split, samples are drawn from the entire training set; splits are chosen completely at random from the range of values in the sample at each split.	The same as Random Forest More random thus lower variance	The same as Random Forest Computation can grow much bigger
GBM (Gradient Boosting Model)	Trains trees sequentially, uses the information from the previous trees, then combined the whole set and averaging to provide a higher prediction; it uses the entire features, and does not ignore 'weak' learners.	Use all features (with 'weak' learners meaning that it does not ignore less important features) Better predictability (= higher accuracy) Lower possibility of overfitting with more trees	Susceptible to outliers since the model uses all the features Lack of interpretability and higher complexity, compared to linear classifiers. Harder to tune hyperparameters than other models Slow to train or score
XGBoost (Extreme Gradient Boost)	Is an advanced implementation of gradient boosting algorithm; it has definitely boosting capabilities and additional components overcoming the weakness of GBM.	Regularization: it reduces overfitting (Standard GBM has no regularization) Support parallel processing Built-in routine to handle missing values. Tree pruning: XGBoost make splits up to the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain. Built-in Cross Validation	The same as GBM (except in general faster than GBM)

Table 4: Summary of Machine Learning Methods

4.2. Procedure

After data cleaning, data imputation, and features selections, we started building our models. There were a few common steps during building our classification models:

Training data sampling: We first trained our model with the full training set. However for some algorithms, it took very long time to tune the parameters and fit the model. Therefore about 30%-40% of the training data were used.
Model fitting for the first try: This gave us a first taste of how the models performed at the first try with some initial parameters.
Parameter tuning: For parameter tuning, we did grid-search using the GridSearchCV package from skitlearn. Since doing a complete grid-search is time consuming on a big dataset on a local computer, we used coarse grid search first. After identifying a "better" region on the gird, a finer gird search on that region can be conducted. Also for some of the algorithms (e.g. ExtraTreeClassifier and Xgboost) we broke out the parameters into groups and tuned them group by group.
Fit the best estimator with the full training set: After we found the best estimator, we refitted the model with full training data to get a better model. Usually more data is better.
Prediction with test data: as the final step, the test data is passed to the model to get the prediction probability.

4.3. Parameters Tuning

Due to the time constraint, we didn't get a chance to finish building models for all the methods. The three models we were able to finish are: Logistic Regression, ExtraTreeClassifier, XGBoost. Below are details of our parameters tuning details for these models.

4.3.1. ExtraTreesClassifier

We first built a ExtraTreesClassifier model and initialized the parameters with some reasonable values. We fit the model with our training samples. It returned score (using Log Loss scoring metric) : 0.4432684.

############ Create a ExtraTreesClassifier with initial parameters ########
model = ExtraTreesClassifier(    
    n_estimators = 100,             # Number of trees
    max_features = 0.8,             # Number of features for each tree
    max_depth = 10,                 # Depth of the tree
    min_samples_split = 2,          # Minimum number of samples required to split
    min_samples_leaf = 1,           # Minimum number of samples in a leaf
    min_weight_fraction_leaf = 0,   # Minimum weighted fraction of the input samples required to be at a leaf node. 
    criterion = 'gini',             # Use gini, not going to tune it
    random_state = 27,
    n_jobs = -1)

Next, we were going to use GridSearchCV to tune the parameters. For ExtraTreesClassifier, we were tuning the following five parameters. Note that we didn't tune n_estimators (number of trees) yet as larger number of trees means slower process. We left it to the end to find a reasonable value n_estimators.

Tuning parameters:

n_estimators (not tuned at the beginning)
max_features
max_depth
min_samples_split
min_samples_leaf

We started coarse parameter search based on the initial model. We assigned three values for each parameter (except n_estimators), which gave us 81 combinations.

####### Coarse Tune Parameters #######
para_grid = [{    
    'max_features': [0.6, 0.75, 0.9],  # Number of features for each tree
    'max_depth': [5, 15, 25],          # Depth of the tree
    'min_samples_split': [5, 10, 50],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [5, 10, 50]    # Minimum number of samples in a leaf
    }]

start = datetime.datetime.now()
para_search = GridSearchCV(model, para_grid, scoring = 'log_loss', cv = 5, n_jobs = 4).fit(X_train, target_train)
end = datetime.datetime.now()
print "model training time: {}".format(end - start)

The result of coarse grid search gave the best combination: {'max_features': 0.9, 'min_samples_split': 5, 'max_depth': 25, 'min_samples_leaf': 50}, and the best Score 0.4740551.

After we found the best combination from coarse search, we performed a finer search in a narrower search range around the values we found previously.

####### Now Fine Tune Parameters #######
para_grid = [{    
    'max_features': [0.85, 0.9, 0.95],# Number of features for each tree
    'max_depth': [20, 25, 30],        # Depth of the tree
    'min_samples_split': [3, 5, 7],   # Minimum number of samples required to split
    'min_samples_leaf': [45, 50, 55]  # Minimum number of samples in a leaf
    }]

The grid search gave the best combination: {'max_features': 0.85, 'min_samples_split': 3, 'max_depth': 25, 'min_samples_leaf': 45}, the score was improved to: 0.4736519. We can see that max_depth remains 25 from this round. We used 25 as the final value for max_depth (we could have continued to tune around 25 if we had more time). However the other three parameters are all changed and thus need further search.

We repeated the same steps to fine tunes the rest of the parameters. In the last round, we searched n_estimators in the range of [100, 300, 500, 700]. The grid search showed that 700 gave better score. Given that 700 was already a big number of trees, we stopped here without going further testing a bigger number which would make it a very long time to train our data. The table below shows the ranges we used to perform the gird search for different values as well as result for each round.

n_estimators	max_features	max_depth	min_samples_split	min_samples_leaf	best combination	score
100	[0.6, 0.75, 0.9]	[5, 15, 25]	[5, 10, 50]	[5, 10, 50]	max_features: 0.9 min_samples_split: 5 max_depth: 25 min_samples_leaf: 50	0.4740551
100	[0.85, 0.9, 0.95]	[20, 25, 30]	[3, 5, 7]	[45, 50, 55]	max_features: 0.85 min_samples_split: 3 max_depth: 25 min_samples_leaf: 45	0.4736519
100	[0.83, 0.85, 0.87]	[25]	[2, 3, 4]	[42, 45, 48]	'max_features': 0.87 'min_samples_split': 2 'max_depth': 25 'min_samples_leaf': 45	0.4735396
100	[0.86, 0.87, 0.88]	[25]	[2]	[45]	'max_features': 0.87 'min_samples_split': 2 'max_depth': 25 'min_samples_leaf': 45	0.4735396
[100, 300, 500, 700]	[0.87]	[25]	[2]	[45]	'n_estimators': 700 'max_features': 0.87 'min_samples_split': 2 'max_depth': 25 'min_samples_leaf': 45	0.4733141

Table 5: ExtraTrees Parameters Search

We then fitted the model with the full training data and then got a much better training score 0.468914.

4.3.2. XGBoost

During building XGBoost model, we also built a base model and with some reasonable values for the parameters. We fit the model with our training samples. It returned score (using Log Loss scoring metric) : 0.434604.

######### Set initial values for the model ######
xgb_model = XGBClassifier(
    learning_rate =0.1,
    n_estimators=1000,
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    objective= 'binary:logistic',
    nthread=8,
    scale_pos_weight=1,
    seed=27)

Figure 3: XGBoost Feature Importance

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=5,
       min_child_weight=1, missing=None, n_estimators=97, nthread=8,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=27, silent=True, subsample=0.8)

For XGB, we were tuning the following seven parameters.

learning_rate: learning rate (not going to be tune at the beginning)
max_depth: depth of the trees
min_child_weight: minimum number of observations in a node
gamma: minimum loss reduction to make a split
subsample: number of selected rows for training sample
colsample_bytree: number of features for training sample
reg_alpha: L1 regularization term on weight (analogous to Lasso regression)

One really useful benefit of XGB is the built-in tree-pruning. It would stop generating more trees at certain point when more trees do not reduce the errors. Therefore we can estimate n_estimators from the initial fit without performing a search for it. Also we used a fixed learning rate of 0.1 to allow faster performance as lower learning rate causes longer time to fit the model. We left it to the end to tune it.

Since the number of combination is big with seven parameters, we took the step by step approach.

We tuned the max_depth and min_child_weight parameters first as they will have the highest impact on model outcome. We started with wider ranges and then we would perform another iteration for smaller ranges.

################# Tune max_depth and min_child_weight ##############
param_test1 = {
    'max_depth':range(3,10,2),
    'min_child_weight':range(1,6,2)
}

gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=100, max_depth=5, \
    min_child_weight=3, gamma=0.3, subsample=0.8, colsample_bytree=0.8, \
    objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27), \
    param_grid = param_test1, scoring='log_loss',n_jobs=4,iid=False, cv=5)

Here, we see that the best combination is: max_depth = 5, min_child_weight=5, the best score is 0.4710678. We were going to go one step deeper and look for optimum values. We would search for values 1 above and below the optimum values.

param_test2 = {
    'max_depth':[4,5,6],
    'min_child_weight':[4,5,6]
}

Here, we got best combination is: max_depth = 6, min_child_weight= 6, the best score is: 0.4706758. The score is slightly better. We can see that the score is better with higher values. We continue to increase the values of both parameters in the next search.

param_test3 = {
    'max_depth':[6,7],
    'min_child_weight':[6,7,8]
}

Now, we see that the optimal values are: max_depth = 6, min_child_weight=7, the best score is 0.470626.

Then we performed similar approach to tune gamma, subsample and colsample_bytree, reg_alpha separately.

Grid search gamma:

gamma	best value	score (log loss)
[0, 0.2, 0.4, 0.6, 0.8]	0.4	0.4702224
[0.3, 0.4, 0.5]	0.4	0.4702224

Table 6: XGBoost Tune Parameter gamma

Grid search subsample and colsample_bytree:

subsample	colsample_bytree	best combination	score
[0.6, 0.7, 0.8, 0.9]	[0.6, 0.7, 0.8, 0.9]	colsample_bytree: 0.8 subsample: 0.8	0.4702224
[0.75, 0.8, 0.85]	[0.75, 0.8, 0.85]	colsample_bytree: 0.85 subsample: 0.85	0.4701294

Table 7: XGBoost Tune Parameters subsample & colsample_bytree

Grid search reg_alpha:

reg_alpha	best value	score
[1e-5, 1e-2, 0.1, 1, 100]	1e-5	0.4701294
[1e-6, 1e-5, 1e-4]	0.0001	0.47001335

Table 8: XGBoost Tune Parameter reg_alpha

Our last grid search ended with:

colsample_bytree: 0.85
gamma: 0.4
max_depth: 6
min_child_weight: 7
reg_alpha: 0.0001
subsample: 0.85
Score: -0.47001335

We then fitted the model with the full training data and then got a much better training score -0.408851.

4.3.2. Logistic Regression

Parameter tuning in Logistic Regression is easier than the above tree-based models. We just did grid search on three parameters and performed a one-time search.

penalty: Used to specify the norm used in the penalization (L1 or L2)
fit_intercept: Specifies if a constant should be added to the decision function
C: Inverse of regularization strength; smaller values specify stronger regularization.

logit = LogisticRegression(random_state=27, n_jobs = -1)

para_grid = [{'penalty': ['l1', 'l2'], 
              'fit_intercept': [False, True], 
              'C':np.logspace(-5, 5, 10)}]

para_search = GridSearchCV(logit, para_grid, scoring='log_loss', cv =5).fit(X_train, target_train)

We got the best combination:

penalty = 'l1'
C = 0.27825594022071259
fit_intercept = True

This gave us the score -0.494125. We then fit the model with full training data and got a score 0.493337, which is not a big improvement.

4.4. Result

When we finally performed predictions with the test data using our three different models and submitted to Kaggle. Without any surprise based on the training score, XGB gave the best test score as well.

We also used model ensembling that combined our models to produce a hopefully improved results. Ensemble methods usually produces more accurate solutions than a single model would. We used a simple ensemble method: weighted voting. Using scikit-learn's VotingClassifier class, specific weights can be assigned to each classifier via the weights parameter. When weights are provided, the predicted class probabilities for each classifier are collected, multiplied by the classifier weight, and averaged.

Sample Code for voting ensembling with equal weight:

############ Try ensemble voting with average ##########
ensemble_avg = VotingClassifier(estimators=[('lr', model_logit), ('xgb', model_xgb), ('extratree', model_extratree)],
voting='soft', weights=[1,1,1])

ensemble_avg.fit(X_train, target_train)

We tried a few different combinations of selected models and weights, most of them didn't outscore the single XGBoost model. Only the combination of XGB + ExtraTrees with weight 3:2 slightly improved the score.

Table: Scores for different models

Classifiers	Weight	Training Score	Test Score
Logistic Regression	N/A	0.49334	0.49533
ExtraTrees	N/A	0.46891	0.47071
XGB	N/A	0.408851	0.46311
Logistic Regression + ExtraTrees + XGB	1:1:1	0.4372	0.46883
Logistic Regression + ExtraTrees + XGB	1:3:2	0.4249	0.46501
XGB + ExtraTrees	3:2	0.4134	0.46280

Table 9: XGBoost Tune Parameter reg_alpha

5. Conclusion

The BNP Paribas Cardif Claims Management competition is challenging due to its anonymous variables and high amount of missing data. During the competition, quite a few different methods were used for data cleaning, data imputation, and feature selections. Different machine learning classifiers such as Logistics Regression, ExtraTrees, XGBoost (Random Forest and GBM were being built but not yet completed) were used for training and predictions. Grid search technique was applied to fine tune our parameters in order to get the best model for each classifier. We also used model ensembling method "weighted voting", which slightly improve the score from one of combinations.

Many of the Kagglers in the competition have used the similar methods that we did. However some of them have higher scores than than us and the rest of the participant. We believe that what has made the difference is feature engineering. Though we applied some simple feature selection techniques such as tree-based feature importance and univariate feature selection, but obviously those were not enough to make a big improvement on the prediction accuracy. Therefore for our future work, we would like to look deeper on the data and do more feature engineering.

Reference

Jain, AArshay (2016, March 1). Complete Guide to Parameter Tuning in XGBoost (with codes in Python). Retrieved from http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

About Authors

Bin Lin

Bin is a former professional in software development with now positioned for Data Scientist role. With both strong programming skill and newly acquired machine learning skills, Bin is able to apply scientific thinking to predict or uncover insights...

View all posts by Bin Lin >

Sung Pil Moon

Sung Moon is a recent graduate from the Ph.D. program in Human-Computer Interaction, School of Informatics, Indiana University (Indianapolis, IN). Through several startup activities and various research projects collaborating with MITRE, a research corporation, he found opportunities to...

View all posts by Sung Pil Moon >

Yun-Ju Huang

A recent graduate of the NYU Integrated Marketing program, Claudia Huang specializes in marketing analysis. Immersed in a creative, trend-sensitive environment, she learned to integrate marketing channels, acquired data analytical skills, and forged a branding mindset. When she...

View all posts by Yun-Ju Huang >

Capstone

Catching Fraud in the Healthcare System

Capstone

The Convenience Factor: How Grocery Stores Impact Property Values

Capstone

Acquisition Due Dilligence Automation for Smaller Firms

Machine Learning

Pandemic Effects on the Ames Housing Market and Lifestyle

Machine Learning

The Ames Data Set: Sales Price Tackled With Diverse Models

Cancel reply

You must be logged in to post a comment.

VincentOr May 7, 2016

http://fds9923sdsd.co qqebt

Kaggle Competition: BNP Paribas Cardif Claims Management

1. Overview

2. The Data set

2.1. Data Summary

2.2. Exploratory Data Analysis

Table 1: Data First Peek

3. Data Pre-Processing

3.1. Data Cleaning

3.1.1. Treatment of Integer Variables with Low Numbers of Unique Values

Table 2: Integer Variables with Low Cardinality

3.1.2. Imputation

3.2. Data Transformation

3.3. Feature Selection

3.3.1. Remove Highly Correlated Variables

Table 3: Highly Correlated Variables

3.3.2. Tree-based Feature Selection

3.3.3. Univariate Feature Selection

4. Model Building

4.1. Machine Learning Methods

Table 4: Summary of Machine Learning Methods

4.2. Procedure

4.3. Parameters Tuning

4.3.1. ExtraTreesClassifier

Table 5: ExtraTrees Parameters Search

4.3.2. XGBoost

Table 6: XGBoost Tune Parameter gamma

Table 7: XGBoost Tune Parameters subsample & colsample_bytree

Table 8: XGBoost Tune Parameter reg_alpha

4.3.2. Logistic Regression

4.4. Result

Table 9: XGBoost Tune Parameter reg_alpha

5. Conclusion

Reference

About Authors

Bin Lin

Sung Pil Moon

Yun-Ju Huang

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!