Top 6% on Kaggle Project: Coupon Purchase Prediction
The copyright of the photo above belongs to the "Coupon Purchase Prediction" Kaggle competition, as posted here.
Pokman Cheung attended NYC Data Science Academy's 12-week full-time Data Science Bootcamp from Jun.-Aug., 2015. Previously, he received a BS in Math and Physics from MIT and a PhD degree in Math from Stanford. He had been conducting academic research for 10+ years before deciding to join the bootcamp and switch careers. Pokman is now an associate at Goldman Sachs, London.
The following post is Pokman's, drawn from his machine learning class project.
Get Pokman's code from his Github: https://github.com/pokman/Kagglebootcamp.
Kaggle Competition Task: Predict Which Coupons a Customer Will Buy
This Kaggle competition was posed by Ponpare, a Japanese company offering discount coupons for a variety of goods and services. The goal was to predict which coupons each user would purchase within a one-week period, given the following data:
(i) a collection of details of each user,
(ii) a collection of details of each coupon, and
(iii) the transactions within the preceding 51-week period.
Here is the original description on the Kaggle webpage.
Pre-Processing the Data
In order to analyze Ponpare's coupons, I needed to be able to represent each coupon by a vector of its features. I got started by turning all categorical variables into dummy binary variables for later analysis: replacing the type of a coupon, for instance, by a bunch of 0's and 1's collectively representing whether or not it is a coupon for restaurant, hotel, spa, etc. I then modified some of the numerical variables (namely, discount rate and discounted price) by certain transformations, so that they became more evenly distributed over their respective ranges.
Training Coupon Classifiers: A Failed Attempt
As a first attempt, I tried classifying for each user which coupons she was interested in based on her purchase history in the training period. I aimed thereby to predict which coupons she would buy within the test period.
After testing out classification methods such as logistic regression and neural networks, however, I deemed the effort a failure. The resulting models were continually problematic. I hypothesized a possible cause: no user ever purchases more than a tiny fraction of all the available coupons.
Quantifying Coupon Similarity: A Better Approach
Based on my working hypothesis, I adopted different approach to quantify the similarity between any two coupons using a weighted version of cosine similarity.
The main question became how much weight to assign to each coupon feature. In order to find an optimal weight combination, I split the given 51-week training data into various training-validation sets, as illustrated below, and evaluated our model on them for various weight combinations.
These operations were implemented by the following R script:
# NOTE: adjust CV_START and dir # CV_START = 2012-06-17, dir = split1 # CV_START = 2012-06-10, dir = split2 # CV_START = 2012-06-03, dir = split3 # Load all the date specific data. load("data/coupon_list_train_en_all.RData") load("data/coupon_visit_train.RData") load("data/coupon_purchased_train.RData") # Set period for validation. week <- 60 * 60 * 24 * 7 CV_START <- as.POSIXlt("2012-06-17") CV_END <- CV_START + week - 1 # Split training data into a smaller training set # and a validation set. dir = "data/split1/" i.train <- (couple_list_train$DISPFROM < CV_START) i.test <- ((couple_list_train$DISPFROM >= CV_START) & (couple_list_train$DISPFROM <= CV_END)) coupons_train <- couple_list_train[i.train,-1] coupons_test <- couple_list_train[i.test,-1] save(coupons_train, file=paste0(dir, "coupon_list_train.RData")) save(coupons_test, file=paste0(dir, "coupon_list_test.RData")) i.train <- (coupon_purchased_train$I_DATE < CV_START) i.test <- ((coupon_purchased_train$I_DATE >= CV_START) & (coupon_purchased_train$I_DATE <= CV_END)) purchases_train <- coupon_purchased_train[i.train,-1] purchases_test <- coupon_purchased_train[i.test,-1] save(purchases_train, file=paste0(dir, "coupon_purchased_train.RData")) save(purchases_test, file=paste0(dir, "coupon_purchased_test.RData")) i.train <- (coupon_visit_train$I_DATE < CV_START) visits_train <- coupon_visit_train[i.train,-1] save(visits_train, file=paste0(dir, "coupon_visit_train.RData"))
The model was then applied to each training-validation set with various weight combinations:
##### This script implements our recommendation model for various ##### parameter combinations, and records their scores (MAPs). ##### ##### edit: dir, reportfile, wts ##### dir = "data/split1/" reportfile = "map-test12" wts <- list(0, 2, c(8, 16, 32, 64), c(8, 16, 32, 64), 1, 0, 0, 8, 1) # Load the prepared user and test coupon features, the script # for making recommendations, and the script for computing MAPs load(paste0(dir, "user_features_2.RData")) load(paste0(dir, "testcoupon_features_2.RData")) load(paste0(dir, "coupon_purchased_test.RData")) source("cos-sim.R") source("map.R") # Implement recommendation model and record its MAP over # a grid of parameter values report <- NULL pos <- rep(1, 9) done <- FALSE while (!done) { w <- do.call(c, lapply(1:9, FUN=function(i){ return(wts[[i]][pos[i]]) })) print(w) W <- as.matrix(Diagonal(x=rep(w, c(24, 13, 1, 1, 1, 4, 5, 47, 55)))) rec <- funcCosineW(W) score <- round(MAP(rec, purchases_test), 6) report <- rbind(report, c(w, score)) i <- 1 while (pos[i] == length(wts[[i]])) { pos[i] <- 1 i <- i + 1 if (i > 9) break } if (i > 9) { done <- TRUE } else { pos[i] <- pos[i] + 1 } } # Save the parameter values and corresponding MAPs write.csv(report, paste0(dir, reportfile, ".csv"), row.names=F)
The table below contains a sample of various weight combinations, their scores on the validation sets, and their scores from the actual submission. (The first batch suggested that the third validation set gave poor predictions; the difference between the last batch and the rest had to do with a certain transformation of one of the numerical features.)
As of this writing, my best model ranks among the top 6% of all submissions to Kaggle's Coupon Purchase Prediction competition.