Walmart Kaggle: Trip Type Classification

Contributed by Joe Eckert, Brandon Schlenker, William Aiken and Daniel Donohue. They took the NYC Data Science Academy 12-week full-time data science bootcamp program from Sep. 23 to Dec. 18, 2015. The post was based on their fourth in-class project (due after the 8th week of the program).

Introduction

Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience.   Walmart's trip types are created from a combination of existing customer insights and purchase history data.  The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart's classification labels.  The goal for Walmart is to refine their trip type classification process.

 

About the Data

  • ~ 96k store visits, segmented into 38 trip types
  • Training and testing data included >1.2 million observations with 6 features:
    • Visit Number, Weekday, UPC, Scan Count, Department Description, Fineline Number
  • Using the 6 provided features the team was tasked with creating the best model to accurately classify the trips into their proper trip type category
  • Challenges with the data
    • Each observation represented an item rather than a visit
    • Needed to group observations by visit to classify the trip
    • Number of unique UPCs and Fineline Numbers prevented the creation of dummy variables - resulting data set was too large to process
    • Instead, used the Department Description to create dummy variables

 

Model 1: Logistic Regression

Implemented multinomial logistic regression to determine trip type.  Normal logistic regression is used for two class predictions.  Multinomial logistic regression performs logistic regression on each class against all others.  The process is repeated until all classes are regressed one vs all.

  • Log loss score: 4.22834


import pandas as pd
import numpy as np
import scipy as sp
from sklearn.linear_model import LogisticRegression
import time

start_time = time.time()

waltrain = pd.read_csv('train.csv')
waltest = pd.read_csv('test.csv')
waltrain = waltrain[waltrain.FinelineNumber.notnull()]
waltrain_part = waltrain[:]
waltest_part = waltest[:]

model = LogisticRegression()
x = waltrain_part[['Weekday', 'DepartmentDescription']]
y = waltrain_part[['TripType']]
x = pd.get_dummies(x)
z = waltest_part[['Weekday', 'DepartmentDescription']]

zend = pd.DataFrame({'Weekday': ['Sunday'],
'DepartmentDescription': ['HEALTH AND BEAUTY AIDS']},
index = [len(z)])
z = z.append(zend)
z = pd.get_dummies(z)

model.fit(x, y)
print "The model coefficients are:"
print model.coef_
print "The intercepts are:"
print model.intercept_

print "model created after %f seconds" % (time.time() - start_time)

submission = model.predict_proba(z)
submissiondf = pd.DataFrame(submission)

submissiondf.drop(len(submissiondf)-1)

dex = waltest.iloc[:,0]
submurge = pd.concat([dex,submissiondf], axis = 1)
avgmurg = submurge.groupby(submurge.VisitNumber).mean()
avgmurg.reset_index(drop = True, inplace = True)
avgmurg.columns = ['VisitNumber', 'TripType_3','TripType_4','TripType_5','TripType_6','TripType_7',\
'TripType_8','TripType_9','TripType_12','TripType_14','TripType_15','TripType_18',\
'TripType_19','TripType_20','TripType_21','TripType_22','TripType_23','TripType_24',\
'TripType_25','TripType_26','TripType_27','TripType_28','TripType_29','TripType_30',\
'TripType_31','TripType_32','TripType_33','TripType_34','TripType_35','TripType_36',\
'TripType_37','TripType_38','TripType_39','TripType_40','TripType_41','TripType_42',\
'TripType_43','TripType_44','TripType_999']

avgmurg[['VisitNumber']] = avgmurg[['VisitNumber']].astype(int)
avgmurg.to_csv('KaggleSub_04.csv', index = False)

print "finished after %f seconds" % (time.time() - start_time)

Model 2: Random Forest

For the second model the team implemented a random forest.  Random forests are a collection of decision trees.  Classification is done by a 'majority vote' of the decision trees within the random forest.  That is, for a given observation the class that is most frequently predicted within the random forest will be the class label for that observation.

Engineered Features:

  • Total number of items per visit
  • Percentage of items purchased based on Department
  • Percentage of items purchased based on Fineline Number
  • Percentage of items purchased by UPC
  • Count of different items purchased (based on UPC)
  • Count of returned items
  • Boolean for presence of returned item

 

Below you can see the progression of the performance of the random forest as adjustments were made:

download

  • Best log loss score: 1.22730

 

Model 3: Gradient Boosted Decision Trees

Gradient boosted trees are a supervised learning method where a strong learner is built from a collection of decision trees in a stagewise fashion, where subsequent trees focus more on observations that were misclassified by earlier trees.

Engineered Features:

  • Day of the week (expressed as an integer)
  • Number of purchases per visit
  • Number of returns per visit
  • Number of times each department was represented in the visit
  • Number of times each fineline number was represented in the visit

For this model the team used the XGBoost and Hyperopt Python packages.  XGBoost is a package for gradient boosted machines, which is popular in Kaggle competitions for its memory efficiency and parallelizability.  Hyperopt is a package for hyperparameter optimization that takes an objective function and minimizes it over some hyperparameter space.  Unfortunately, we needed to split the training set into two halves (the prepared dataset was too large to keep in memory), train two XGBoost models, and then average their results.  Not training on the whole dataset is probably what resulted in the larger log loss score.

  • Best log loss score: 1.48

The code for this approach to the problem can be found here.

 

Conclusion

Given the size of the data set the accuracy achieved was limited due to memory constraints.  The best performance was achieved using random forest after implementing grid search for feature selection and parameterization.  Feature engineering was extremely important in this competition given that the rules restricted the use of external data.

About Authors

Joe Eckert

Joe is currently studying with the NYC Data Science Academy to pursue his passion for big data. Joe previously worked for 3 years at JPMorgan's Corporate Bank. He graduated in 2012 with a BA in Financial Economics from...
View all posts by Joe Eckert >

William Aiken

Nate Aiken graduated from City College in 2014 with a BS in Biology with a focus in Neuroscience. His experience studying vision and hearing in labs at City and Rockefeller University led him to the bootcamp. He enjoys...
View all posts by William Aiken >

Daniel Donohue

Daniel Donohue (A.B. Mathematics, M.S. Mathematics) spent the last three years as a Ph.D. student in mathematics studying topics in algebraic geometry, but decided a few short months ago that he needed a change in venue and career....
View all posts by Daniel Donohue >

Leave a Comment

تحميل اغانى شعبى June 4, 2017
What's up,I check your blog named "Walmart Kaggle: Trip Type Classification - NYC Data Science Academy BlogNYC Data Science Academy Blog" daily.Your writing style is awesome, keep up the good work! And you can look our website about تحميل اغانى شعبى.
Vikrant December 19, 2015
Very Informative article. Do you what was your best LB score in kaggle using these models?
William Aiken December 17, 2015
That's a great question, we initially used a trial and error method of feature generation. If you use Random Forest in Scikit-Learn, one of the outputs that you can look at is the 'feature importance'. It tells you the relative depth of a feature used as a node in your trees. The idea behind this is that features found at the top of the tree are playing a larger role in your final predictions. There is another blog post just covering Random Forest that will give you a better idea of how we generated new features.
Minghua December 16, 2015
I am new to data mining, but I studied this case thoroughly. Very impressive article! Not to mention your team came up with 3 different approaches! Could you tell me how you guys decide which features are about to use during feature engineering? You just brainstorm or there's a methodology for feature selecting upon which your approaches are based.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI