Top 6% on Kaggle Project: Coupon Purchase Prediction

Pokman Cheung
Posted on Aug 30, 2015

The copyright of the photo above belongs to the "Coupon Purchase Prediction" Kaggle competition, as posted here.

Pokman Cheung attended NYC Data Science Academy's 12-week full-time Data Science Bootcamp from Jun.-Aug., 2015.  Previously, he received a BS in Math and Physics from MIT and a PhD degree in Math from Stanford. He had been conducting academic research for 10+ years before deciding to join the bootcamp and switch careers. Pokman is now an associate at Goldman Sachs, London.

The following post is Pokman's, drawn from his machine learning class project. 

Get Pokman's code from his Github: https://github.com/pokman/Kagglebootcamp.

 

Kaggle Competition Task: Predict Which Coupons a Customer Will Buy

This Kaggle competition was posed by Ponpare, a Japanese company offering discount coupons for a variety of goods and services. The goal was to predict which coupons each user would purchase within a one-week period, given the following data:

(i) a collection of details of each user,

(ii) a collection of details of each coupon, and

(iii) the transactions within the preceding 51-week period.

Here is the original description on the Kaggle webpage.

 

Pre-Processing the Data

In order to analyze Ponpare's coupons, I needed to be able to represent each coupon by a vector of its features. I got started by turning all categorical variables into dummy binary variables for later analysis: replacing the type of a coupon, for instance, by a bunch of 0's and 1's collectively representing whether or not it is a coupon for restaurant, hotel, spa, etc. I then modified some of the numerical variables (namely, discount rate and discounted price) by certain transformations, so that they became more evenly distributed over their respective ranges.

 

Training Coupon Classifiers: A Failed Attempt

As a first attempt, I tried classifying for each user which coupons she was interested in based on her purchase history in the training period. I aimed thereby to predict which coupons she would buy within the test period.

After testing out classification methods such as logistic regression and neural networks, however, I deemed the effort a failure. The resulting models were continually problematic. I hypothesized a possible cause: no user ever purchases more than a tiny fraction of all the available coupons.

 

Quantifying Coupon Similarity: A Better Approach

Based on my working hypothesis, I adopted different approach to quantify the similarity between any two coupons using a weighted version of cosine similarity.

The main question became how much weight to assign to each coupon feature. In order to find an optimal weight combination, I split the given 51-week training data into various training-validation sets, as illustrated below, and evaluated our model on them for various weight combinations.

tab-validation

These operations were implemented by the following R script:

# NOTE: adjust CV_START and dir
# CV_START = 2012-06-17, dir = split1
# CV_START = 2012-06-10, dir = split2
# CV_START = 2012-06-03, dir = split3


# Load all the date specific data.

load("data/coupon_list_train_en_all.RData")
load("data/coupon_visit_train.RData")
load("data/coupon_purchased_train.RData")


# Set period for validation.

week <- 60 * 60 * 24 * 7
CV_START <- as.POSIXlt("2012-06-17")
CV_END <- CV_START + week - 1


# Split training data into a smaller training set
# and a validation set.

dir = "data/split1/"

i.train <- (couple_list_train$DISPFROM < CV_START)
i.test <- ((couple_list_train$DISPFROM >= CV_START) &
             (couple_list_train$DISPFROM <= CV_END))
coupons_train <- couple_list_train[i.train,-1]
coupons_test <- couple_list_train[i.test,-1]
save(coupons_train, file=paste0(dir, "coupon_list_train.RData"))
save(coupons_test, file=paste0(dir, "coupon_list_test.RData"))

i.train <- (coupon_purchased_train$I_DATE < CV_START)
i.test <- ((coupon_purchased_train$I_DATE >= CV_START) &
             (coupon_purchased_train$I_DATE <= CV_END))
purchases_train <- coupon_purchased_train[i.train,-1]
purchases_test <- coupon_purchased_train[i.test,-1]
save(purchases_train, file=paste0(dir, "coupon_purchased_train.RData"))
save(purchases_test, file=paste0(dir, "coupon_purchased_test.RData"))

i.train <- (coupon_visit_train$I_DATE < CV_START)
visits_train <- coupon_visit_train[i.train,-1]
save(visits_train, file=paste0(dir, "coupon_visit_train.RData"))

The model was then applied to each training-validation set with various weight combinations:

##### This script implements our recommendation model for various
##### parameter combinations, and records their scores (MAPs).
#####
##### edit: dir, reportfile, wts
#####

dir = "data/split1/"
reportfile = "map-test12"
wts <- list(0, 2, c(8, 16, 32, 64), c(8, 16, 32, 64), 1, 0, 0, 8, 1)

# Load the prepared user and test coupon features, the script
# for making recommendations, and the script for computing MAPs

load(paste0(dir, "user_features_2.RData"))
load(paste0(dir, "testcoupon_features_2.RData"))
load(paste0(dir, "coupon_purchased_test.RData"))
source("cos-sim.R")
source("map.R")

# Implement recommendation model and record its MAP over
# a grid of parameter values

report <- NULL
pos <- rep(1, 9)
done <- FALSE

while (!done) {
  w <- do.call(c, lapply(1:9, 
                         FUN=function(i){ return(wts[[i]][pos[i]]) }))
  print(w)
  W <- as.matrix(Diagonal(x=rep(w, c(24, 13, 1, 1, 1, 4, 5, 47, 55))))
  rec <- funcCosineW(W)
  score <- round(MAP(rec, purchases_test), 6)
  report <- rbind(report, c(w, score))
  
  i <- 1
  while (pos[i] == length(wts[[i]])) {
    pos[i] <- 1
    i <- i + 1
    if (i > 9) break
  }
  if (i > 9) { done <- TRUE }
  else { pos[i] <- pos[i] + 1 }
}

# Save the parameter values and corresponding MAPs

write.csv(report, paste0(dir, reportfile, ".csv"), row.names=F)

The table below contains a sample of various weight combinations, their scores on the validation sets, and their scores from the actual submission. (The first batch suggested that the third validation set gave poor predictions; the difference between the last batch and the rest had to do with a certain transformation of one of the numerical features.)

tab-scores

As of this writing, my best model ranks among the top 6% of all submissions to Kaggle's Coupon Purchase Prediction competition.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp