Otto Product Classification: Kaggle Case Study
The 4th NYC Data Science Academy class project requires students to work as a team and finish a Kaggle competition. The 2017 online bootcamp spring cohort teamed up and picked the Otto Group Product Classification Challenge. It was one of the most popular challenges with more than 3,500 participating teams before it ended a couple of years ago. Although high leaderboard score was desirable, our primary focus was to take a hands-on learning approach to a wide variety of machine learning algorithms and gain practice using them to solve real-world problems.
Otto Group is one of the world’s biggest e-commerce companies. It sponsored the competition seeking a way to more accurately group their products into product lines for further business analysis and decision-making. The challenge was to come up with a predictive model to best classify products into their respective categories. This blog post presents several different approaches and analyzes the pros and cons of each with their respective outcomes.
The challenge boiled down to a supervised, multinomial classification exercise. The training set provided by Otto Group consisted of about 62,000 observations (individual products). Each had 93 numeric features and a labeled categorical outcome class (product lines). In total, there were nine possible product lines. The goal was to accurately make class predictions on roughly 144,000 unlabeled products based on 93 features.
Since the features and the classes were labeled simply as feat_1, feat_2, class_1, class_2, etc., we couldn’t use any domain knowledge or interpret anything from the real world. We therefore sought a modeling approach centered around predictive accuracy, choosing models that tended to be more complex and less interpretable.
Without much real-world interpretability of any of the features, an initial exploration of the dataset was essential.
An inspection of the response variable revealed an imbalance in class membership. Class_2 was the most frequently-observed product class, and Class_1 was the least frequently-observed. This gave us a rough idea that the data was biased toward certain classes and would require some method of sampling when we fit it to the models down the road.
Next, we took a look at the feature variables. A correlation plot identified the highly correlated pairs among the 93 features. The red tiles below show the intensity of positive correlations, and the blue ones show the intensity of negative correlations. Using information gained from the plot, we could eliminate or combine two features with high correlations.
All 93 features were comprised of numeric values, so we also looked at their value distribution related to the predicted outcome classes. As the plot below shows, some of the features have a limited number of values and can be treated as categorical values when doing feature engineering. It might also be worth standardizing the value ranges for all features if we were to use lasso regression for feature selection.
Principal component analysis and resulting scree plot revealed a "cutoff point" of around 68 components. This threshold indicates that in attempting to capture the collective variability among all feature variables, a significant portion of the variability can be explained with only 68 principal components rather than the original 93 features. Although we opted to keep all features during the project, principal components after the 68th (those not contributing much to cumulative variance) could be dropped from the model as a means of dimension reduction.
Before the model fitting process it was necessary to understand the Kaggle scoring metric for this contest, which would have bearing on the modeling approaches chosen.
Kaggle uses multi-class logarithmic loss to evaluate classification accuracy. The use of logloss has the effect of heavily penalizing test observations where a low probability is estimated for the correct class. It also necessitates that the submission be a probability matrix, with each row containing the probability of the given product being in each of the nine classes. Given this required format, we attempted to develop methods to combine individual model predictions to a single submission probability matrix.
We approached this multinomial classification problem from two major angles, regression models and tree-based models. Each of the team members tried different model types; several methods of ensembling were then attempted to combine individual model outputs into the final contest predictions.
In order to conduct our own test before submitting to Kaggle, we partitioned the 62,000 rows of training data into a training set of 70 percent and a test set of the remaining 30 percent. Through the use of the set.seed() function/parameter in many R functions, we made sure that all models were reproducible, i.e. built and tested with the exact same training and testing sets and therefore could be accurately cross-compared for performance.
While the k-Nearest Neighbors (kNN) algorithm could be effective for some classification problems, its limitations made it poorly suited to the Otto dataset.
One obvious limitation is inherent in the kNN implementation of several R packages. Kaggle required the submission file to be a probability matrix of all nine classes for the given observations. The R packages – we used class here – only returned the predicted probability for what it predicted to be the correct class, not for the other classes. The inability to return predicted probabilities for each class made the model a less useful candidate in this competition.
In an attempt to work around the issue, we developed a process to synthesize the probabilities for all classes. We created kNN models using different values of K and combined the predicted probabilities from these models.
The attempt didn’t result in accurate results, though. The resulting Kaggle log-loss score wasn’t at all competitive. The lack of true multi-class probabilities is almost certainly the cause of the poor performance of the kNN models. With only a predicted probability of one of nine classes for each observation, there was an insufficient basis to predict probabilities well for the other eight classes. Having inadequate probability predictions for the remaining classes resulted in an uncompetitive model.
Generating kNN models was also time consuming. The time required to compute distances between each observation in the test dataset and the training dataset for all 93 features was significant, and limited the opportunity to use grid search to select an optimal value of K and an ideal distance measure.
Given more time, it might be better to use kNN in the process of feature engineering to create meta features for this competition. Instead of using kNN directly as a prediction method, it would be more appropriate to use its output as another feature that xgboost or another more competitive model could use as an input. Choosing different values of K or different distance metrics could produce multiple meta features that other models could use.
Regression methods could be used to solve classification problems as long as the response variables could be grouped into proper buckets. For this problem, we wanted to see if logistic regression would be a valid approach.
Procedurally, we broke the problem down into nine binomial regression problems. For each binomial regression problem, we predicted whether the product would fall into one class and used stepwise feature selection (AIC used here) to improve the strength of the models. We then aggregated the probabilities of the nine classes, weighted by the deviance of these nine models, into one single final probability matrix.
Using the base R lm() function, we found this approach to be extremely time consuming. Running one binomial regression model with stepwise feature selection could take up to an hour for the training set. The multi logloss score was slightly better than kNN, but still not competitive enough. The weights assigned to the nine models seemed to have a significant influence on the accuracy of the model. We might be able to combine boosting and resampling to get better scores, but the limited computational performance of the base lm() function prompted us to look for a faster and more capable alternative.
Since high-performance machine learning platform h2o can be conveniently accessed via an R package, h2o’s machine learning methods were used for the next three models. H2o proved to be a powerful tool in reducing training time and addressing computational challenges on the large Otto training set, as compared to native R packages.
Generalized Linear Models
The h2o.glm() function mimics the generalized linear model capability of base R, with enhancement for grid searching and hyper-parameter tuning. This model was trained on the 70-percent training set with a specification of “multinomial” for error distribution. Although grid search was performed over a range of alpha (penalization type between L1 and L2 norm) and lambda (amount of coefficient shrinkage), predictive accuracy was not improved while computation time increased. Ultimately, no ridge or lasso penalization was implemented. The overall GLM strategy produced average logloss performance on the 30-percent test set.
Random Forests and Gradient Boosted Trees
H2o provides functions for both of these tree-based methods. Despite sharing many of the same tuning parameters and using similar sampling methods, random forest modeling on the Otto dataset – even at a small number of 50 trees – was computationally slow and provided only average predictive accuracy. The gradient boosted trees model, in which decision trees were created sequentially to reduce the residual errors from the previous trees, performed quite well and at a reasonable speed. This model was implemented with ntrees = 100 and the default learn rate of 0.1.
Developing a neural network model using the h2o package provided fast results with moderate accuracy, but it did not match the most effective methods employed, such as extreme boosting.
The h2o package’s deeplearning function was used to construct a neural network model. The ability to compute logloss values and return predicted probabilities by class made the package suitable to provide results that could be readily submitted to Kaggle or combined with the results of other models.
The deeplearning function offers many parameters, including the number of hidden neurons, the number of layers in which neurons are configured and a choice of activation functions. Two layers of 230 hidden neurons yielded the lowest logloss value of the configurations. The activation function selected was the tanh with dropout function in order to avoid overfitting.
While the neural networks model could be built in minutes, it scored average accuracy, with logloss values in the 0.68 range. It is not clear that further tuning of the model parameters would yield significant reduction in the logloss value.
Extreme Boosting - xgboost
Combining high predictive accuracy gradient boosting without added computational efficiency, the xgboost package provided a quick and accurate method for this project, ultimately providing the best logloss value of all models attempted. Cross validation was performed to identify appropriate tree depth and avoid overfitting. Then, an xgboost model was trained and applied the test set to score the logloss value.
Numerous parameters had to be tuned to achieve better predictive accuracy. Due to time limitations, we only tested the following parameters:
- Early Stopping Rounds
- This function effectively stops the program from fitting additional models if the objective function has not improved in the specified number of rounds.
- Eta (Learning Rate)
- This number determines how much error you want to remember from previous models. Since this is an additive model, a low number corresponds to a relatively low learning speed. The model remembers a small percentage of the errors from the fitted models. A high number corresponds to a high learning speed and uses the full error values plus the results of the fitted model to predict your model. A high number could lead to overfitting very quickly. This value is between 0 and 1. We used 0.3 for this project.
- Maximum tree depth
- The default value was 6. We used 5 to prevent overfitting.
- This specifies the maximum number of models you want to build in order to arrive at the best model without overfitting. Utilizing the early stopping rounds value, the value here can be huge. We used 500 for this project, but with early stopping rounds, the best model was usually achieved (meaning the logloss value stopped improving) only after about 120 models.
The multi-logloss value for the 30-percent test set was 0.51 – the best from all of the models discussed above. When we used it on the real test data for Kaggle submission, we got a score of 0.47.
Although each model was submitted to Kaggle to test individual performance, we also aimed to combine models for improved predictive accuracy. Two methods, averaging and stacking, were used for ensembling.
Model averaging is a strategy often employed to diversify, or generalize, model prediction. Many models are fit on a given training set and their predictions are averaged (in the classification context, a majority vote is taken) - diluting the effect of any single, overfit model's prediction on test set accuracy. This method was used to combine test set predictions from our six individual models but did not improve overall accuracy, even when attributing larger voting weights to stronger predictors like xgboost.
Stacking was used as a method in building the xgboost and neural network models. We were interested to attempt stacking, as the method was employed by the top teams on this Kaggle competition's leaderboard. Stacking involves fitting initial, "tier 1" models and using the resulting predictions as meta-features in training subsequent models.
For this project, we used the predictions from an xgboost and neural network model as meta-features for a second-tier xgboost model. The multi-logloss value obtained from our 30-percent test set was 0.56, a worse test accuracy than the xgboost model alone.
Generally speaking, ensembling is an advanced strategy used in Kaggle contests, often for the sake of marginal gains in predictive accuracy. The small number of models and low complexity involved in both our ensemble strategies is a likely reason for limited success in this area (the winning team utilized 30 models stacked across a 3-layer architecture).
To conclude, the best multi-logloss value achieved from our experiments was at 0.47, using the xgboost model alone. This put us around the 1100th position on the competition leaderboard, as of the end of April, 2017.
We note the following key takeaways from this classification project, perhaps also applicable to other competitive Kaggle contests:
- More complex, tree-based models tended to result in the highest test classification accuracy. Although simpler, linear models (in this case, the logistic regression approach attempted) are inherently more interpretable than tree-based models, anonymization of the datasets led us to generally de-value interpretability early on in the modeling process, in favor of more complex models and more powerful predictive accuracy.
- High-performance packages such as h2o and xgboost can be extremely valuable in decreasing computational time. R packages caret and class were quite slow, lessening their effectiveness against the high computational demands of large datasets provided by Kaggle.
Summary of model performances
|kNN||class||To synthesize probabilities for multiple classes, kNN models were created for several values of K and the probabilities predicted by each model were combined.||Range of values of K from K = 1 to K = 50; Euclidean distance metric.||Low predictive accuracy and high computation time.|
|Logistic regression (one-vs-all)||car/mass||Use stepwise logistic regression to build nine models each corresponding to one target class; average the models with a weight of model deviance.||AIC for stepwise feature selection; used deviance for weights.||Low predictive accuracy.|
|Multinomial regression||h2o||h2o.glm function with family "multinomial"||Grid search performed across alpha and lambda; ultimately no regularization used.||Average predictive accuracy with high computation time.|
|Random forest||h2o||h2o.randomForest function with default parameters||n_trees = 50, max_splits = 20, 10 features selected at random per tree split. Grid search proved to expensive, especially at high number of trees.||Average predictive accuracy with high computation time.|
|GBM||h2o||h2o.gbm function with (mostly) default paramaters||ntrees = 100 and learn_rate = 0.1||Good predictive accuracy and computation times. Used on final test set to achieve 2nd best LB score.|
|Neural net||h2o||The h2o package's deeplearning function offers many parameters for neural network modeling along with high computational speeds due to h2o's ability to dedicate all of a CPU’s processing power to model computation.||Used Tanh with Dropout as the activation function. Used two hidden layers of 230 neurons each.||Fast computation with moderate accuracy|
|Extreme boosting||xgboost||xgboost package allowed for extreme boosting and output the best predictive value||Learning rate: 0.3; maximum tree depth: 5; number of rounds: 500||Best predictive accuracy, but high computation time|