Otto Product Classification: Kaggle Case Study

Introduction

The 4th NYCDSA class project requires students to work as a team and finish a Kaggle competition. The 2017 online bootcamp spring cohort teamed up and picked the Otto Group Product Classification Challenge.  It was one of the most popular challenges with more than 3,500 participating teams before it ended a couple of years ago. Although high leaderboard score was desirable, our primary focus was to take a hands-on learning approach to a wide variety of machine learning algorithms and gain practice using them to solve real-world problems.

Otto Group is one of the world’s biggest e-commerce companies. It sponsored the competition seeking a way to more accurately group their products into product lines for further business analysis and decision-making. The challenge was to come up with a predictive model to best classify products into their respective categories. This blog post presents several different approaches and analyzes the pros and cons of each with their respective outcomes.

Dataset

The challenge boiled down to a supervised, multinomial classification exercise. The training set provided by Otto Group consisted of about 62,000 observations (individual products). Each had 93 numeric features and a labeled categorical outcome class (product lines). In total, there were nine possible product lines. The goal was to accurately make class predictions on roughly 144,000 unlabeled products based on 93 features.

Since the features and the classes were labeled simply as feat_1, feat_2, class_1, class_2, etc., we couldn’t use any domain knowledge or interpret anything from the real world. We therefore sought a modeling approach centered around predictive accuracy, choosing models that tended to be more complex and less interpretable.

EDA

Without much real-world interpretability of any of the features, an initial exploration of the dataset was essential.

An inspection of the response variable revealed an imbalance in class membership. Class_2 was the most frequently-observed product class, and Class_1 was the least frequently-observed. This gave us a rough idea that the data was biased toward certain classes and would require some method of sampling when we fit it to the models down the road.

Response variable frequency

Response variable frequency

Next, we took a look at the feature variables. A correlation plot identified the highly correlated pairs among the 93 features. The red tiles below show the intensity of positive correlations, and the blue ones show the intensity of negative correlations. Using information gained from the plot, we could eliminate or combine two features with high correlations.

Correlation plot

Correlation plot

All 93 features were comprised of numeric values, so we also looked at their value distribution related to the predicted outcome classes. As the plot below shows, some of the features have a limited number of values and can be treated as categorical values when doing feature engineering. It might also be worth standardizing the value ranges for all features if we were to use lasso regression for feature selection.

Value distribution of the first 30 features

Value distribution of the first 30 features

Principal component analysis and resulting scree plot revealed a "cutoff point" of around 68 components. This threshold indicates that in attempting to capture the collective variability among all feature variables, a significant portion of the variability can be explained with only 68 principal components rather than the original 93 features. Although we opted to keep all features during the project, principal components after the 68th (those not contributing much to cumulative variance) could be dropped from the model as a means of dimension reduction.

PCA scree plot

PCA scree plot

Scoring

Before the model fitting process it was necessary to understand the Kaggle scoring metric for this contest, which would have bearing on the modeling approaches chosen.

Screen Shot 2017-05-04 at 5.41.02 PM

Kaggle uses multi-class logarithmic loss to evaluate classification accuracy. The use of logloss has the effect of heavily penalizing test observations where a low probability is estimated for the correct class. It also necessitates that the submission be a probability matrix, with each row containing the probability of the given product being in each of the nine classes. Given this required format, we attempted to develop methods to combine individual model predictions to a single submission probability matrix.

Models

We approached this multinomial classification problem from two major angles, regression models and tree-based models. Each of the team members tried different model types; several methods of ensembling were then attempted to combine individual model outputs into the final contest predictions.

In order to conduct our own test before submitting to Kaggle, we partitioned the 62,000 rows of training data into a training set of 70 percent and a test set of the remaining 30 percent. Through the use of the set.seed() function/parameter in many R functions, we made sure that all models were reproducible, i.e. built and tested with the exact same training and testing sets and therefore could be accurately cross-compared for performance. 

k-Nearest Neighbors

While the k-Nearest Neighbors (kNN) algorithm could be effective for some classification problems, its limitations made it poorly suited to the Otto dataset.

One obvious limitation is inherent in the kNN implementation of several R packages. Kaggle required the submission file to be a probability matrix of all nine classes for the given observations. The R packages – we used class here – only returned the predicted probability for what it predicted to be the correct class, not for the other classes. The inability to return predicted probabilities for each class made the model a less useful candidate in this competition.

In an attempt to work around the issue, we developed a process to synthesize the probabilities for all classes. We created kNN models using different values of K and combined the predicted probabilities from these models.

The attempt didn’t result in accurate results, though. The resulting Kaggle log-loss score wasn’t at all competitive. The lack of true multi-class probabilities is almost certainly the cause of the poor performance of the kNN models. With only a predicted probability of one of nine classes for each observation, there was an insufficient basis to predict probabilities well for the other eight classes. Having inadequate probability predictions for the remaining classes resulted in an uncompetitive model.

Generating kNN models was also time consuming. The time required to compute distances between each observation in the test dataset and the training dataset for all 93 features was significant, and limited the opportunity to use grid search to select an optimal value of K and an ideal distance measure.

Given more time, it might be better to use kNN in the process of feature engineering to create meta features for this competition. Instead of using kNN directly as a prediction method, it would be more appropriate to use its output as another feature that xgboost or another more competitive model could use as an input. Choosing different values of K or different distance metrics could produce multiple meta features that other models could use.

Logistic Regression

Regression methods could be used to solve classification problems as long as the response variables could be grouped into proper buckets. For this problem, we wanted to see if logistic regression would be a valid approach.

Procedurally, we broke the problem down into nine binomial regression problems. For each binomial regression problem, we predicted whether the product would fall into one class and used stepwise feature selection (AIC used here) to improve the strength of the models. We then aggregated the probabilities of the nine classes, weighted by the deviance of these nine models, into one single final probability matrix.  

Using the base R lm() function, we found this approach to be extremely time consuming. Running one binomial regression model with stepwise feature selection could take up to an hour for the training set. The multi logloss score was slightly better than kNN, but still not competitive enough. The weights assigned to the nine models seemed to have a significant influence on the accuracy of the model. We might be able to combine boosting and resampling to get better scores, but the limited computational performance of the base lm() function prompted us to look for a faster and more capable alternative.


H2o Modeling

Since high-performance machine learning platform h2o can be conveniently accessed via an R package, h2o’s machine learning methods were used for the next three models. H2o proved to be a powerful tool in reducing training time and addressing computational challenges on the large Otto training set, as compared to native R packages.

Generalized Linear Models

The h2o.glm() function mimics the generalized linear model capability of base R, with enhancement for grid searching and hyper-parameter tuning. This model was trained on the 70-percent training set with a specification of “multinomial” for error distribution. Although grid search was performed over a range of alpha (penalization type between L1 and L2 norm) and lambda (amount of coefficient shrinkage), predictive accuracy was not improved while computation time increased. Ultimately, no ridge or lasso penalization was implemented. The overall GLM strategy produced average logloss performance on the 30-percent test set.

Random Forests and Gradient Boosted Trees

H2o provides functions for both of these tree-based methods. Despite sharing many of the same tuning parameters and using similar sampling methods, random forest modeling on the Otto dataset – even at a small number of 50 trees – was computationally slow and provided only average predictive accuracy. The gradient boosted trees model, in which decision trees were created sequentially to reduce the residual errors from the previous trees, performed quite well and at a reasonable speed. This model was implemented with ntrees = 100 and the default learn rate of 0.1.

Neural Networks

Developing a neural network model using the h2o package provided fast results with moderate accuracy, but it did not match the most effective methods employed, such as extreme boosting.

The h2o package’s deeplearning function was used to construct a neural network model. The ability to compute logloss values and return predicted probabilities by class made the package suitable to provide results that could be readily submitted to Kaggle or combined with the results of other models.

The deeplearning function offers many parameters, including the number of hidden neurons, the number of layers in which neurons are configured and a choice of activation functions. Two layers of 230 hidden neurons yielded the lowest logloss value of the configurations. The activation function selected was the tanh with dropout function in order to avoid overfitting.

While the neural networks model could be built in minutes, it scored average accuracy, with logloss values in the 0.68 range. It is not clear that further tuning of the model parameters would yield significant reduction in the logloss value.


Extreme Boosting - xgboost

Combining high predictive accuracy gradient boosting without added computational efficiency, the xgboost package provided a quick and accurate method for this project, ultimately providing the best logloss value of all models attempted. Cross validation was performed to identify appropriate tree depth and avoid overfitting. Then, an xgboost model was trained and applied the test set to score the logloss value.

Numerous parameters had to be tuned to achieve better predictive accuracy. Due to time limitations, we only tested the following parameters:

  • Early Stopping Rounds
    • This function effectively stops the program from fitting additional models if the objective function has not improved in the specified number of rounds.
  • Eta (Learning Rate)
    • This number determines how much error you want to remember from previous models. Since this is an additive model, a low number corresponds to a relatively low learning speed. The model remembers a small percentage of the errors from the fitted models. A high number corresponds to a high learning speed and uses the full error values plus the results of the fitted model to predict your model. A high number could lead to overfitting very quickly. This value is between 0 and 1. We used 0.3 for this project.
  • Maximum tree depth
    • The default value was 6. We used 5 to prevent overfitting.
  • Nrounds
    • This specifies the maximum number of models you want to build in order to arrive at the best model without overfitting. Utilizing the early stopping rounds value, the value here can be huge. We used 500 for this project, but with early stopping rounds, the best model was usually achieved (meaning the logloss value stopped improving) only after about 120 models.

The multi-logloss value for the 30-percent test set was 0.51 – the best from all of the models discussed above. When we used it on the real test data for Kaggle submission, we got a score of 0.47.

Ensembling Methods

Although each model was submitted to Kaggle to test individual performance, we also aimed to combine models for improved predictive accuracy. Two methods, averaging and stacking, were used for ensembling.

Averaging

Model averaging is a strategy often employed to diversify, or generalize, model prediction. Many models are fit on a given training set and their predictions are averaged (in the classification context, a majority vote is taken) - diluting the effect of any single, overfit model's prediction on test set accuracy. This method was used to combine test set predictions from our six individual models but did not improve overall accuracy, even when attributing larger voting weights to stronger predictors like xgboost.

Stacking

Stacking was used as a method in building the xgboost and neural network models. We were interested to attempt stacking, as the method was employed by the top teams on this Kaggle competition's leaderboard. Stacking involves fitting initial, "tier 1" models and using the resulting predictions as meta-features in training subsequent models.

For this project, we used the predictions from an xgboost and neural network model as meta-features for a second-tier xgboost model. The multi-logloss value obtained from our 30-percent test set was 0.56, a worse test accuracy than the xgboost model alone.

Generally speaking, ensembling is an advanced strategy used in Kaggle contests, often for the sake of marginal gains in predictive accuracy. The small number of models and low complexity involved in both our ensemble strategies is a likely reason for limited success in this area (the winning team utilized 30 models stacked across a 3-layer architecture).

Takeaways

To conclude, the best multi-logloss value achieved from our experiments was at 0.47, using the xgboost model alone. This put us around the 1100th position on the competition leaderboard, as of the end of April, 2017.

We note the following key takeaways from this classification project, perhaps also applicable to other competitive Kaggle contests:

  1. More complex, tree-based models tended to result in the highest test classification accuracy. Although simpler, linear models (in this case, the logistic regression approach attempted) are inherently more interpretable than tree-based models, anonymization of the datasets led us to generally de-value interpretability early on in the modeling process, in favor of more complex models and more powerful predictive accuracy.
  2. High-performance packages such as h2o and xgboost can be extremely valuable in decreasing computational time. R packages caret and class were quite slow, lessening their effectiveness against the high computational demands of large datasets provided by Kaggle.

Appendix:

Summary of model performances

Model Package(s) Description Parameters used Outcome
kNN class To synthesize probabilities for multiple classes, kNN models were created for several values of K and the probabilities predicted by each model were combined. Range of values of K from K = 1 to K  = 50; Euclidean distance metric. Low predictive accuracy and high computation time.
Logistic regression (one-vs-all) car/mass Use stepwise logistic regression to build nine models each corresponding to one target class; average the models with a weight of model deviance. AIC for stepwise feature selection; used deviance for weights. Low predictive accuracy.
Multinomial regression h2o h2o.glm function with family "multinomial" Grid search performed across alpha and lambda; ultimately no regularization used. Average predictive accuracy with high computation time.
Random forest h2o h2o.randomForest function with default parameters n_trees = 50, max_splits = 20, 10 features selected at random per tree split. Grid search proved to expensive, especially at high number of trees. Average predictive accuracy with high computation time.
GBM h2o h2o.gbm function with (mostly) default paramaters ntrees = 100 and learn_rate = 0.1 Good predictive accuracy and computation times. Used on final test set to achieve 2nd best LB score.
Neural net h2o The h2o package's deeplearning function offers many parameters for neural network modeling along with high computational speeds due to h2o's ability to dedicate all of a CPU’s processing power to model computation. Used Tanh with Dropout as the activation function. Used two hidden layers of 230 neurons each. Fast computation with moderate accuracy
Extreme boosting xgboost xgboost package allowed for extreme boosting and output the best predictive value Learning rate: 0.3; maximum tree depth: 5; number of rounds: 500 Best predictive accuracy, but high computation time

 

About Authors

Thomas Kassel

Thomas Kassel

Thomas completed a B.A. at Wesleyan University, focusing on molecular neuroscience while completing additional coursework in math and economics. After transitioning from the life sciences into the field of clean technology he joined his current firm, energy efficiency...
View all posts by Thomas Kassel >
Avatar

Evan Frisch

Evan Frisch has more than a decade and a half of experience using technology and data to achieve results for organizations in the private, public, and non-profit sectors. Evan received his undergraduate degree with honors from Yale University,...
View all posts by Evan Frisch >
Efezino Erome-Utunedi

Efezino Erome-Utunedi

Efezino recently completed his MENG in Mechatronics Design at the University of British Columbia, focusing on controls engineering. He now works full-time at an engineering consulting firm while enrolled in the NYCDSA's 2017 January to May online cohort,...
View all posts by Efezino Erome-Utunedi >

Related Articles

Leave a Comment

Your email address will not be published. Required fields are marked *

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags