Predicting a New User's First Travel Destination on AirBnB (Capstone Project)
Contributed by Michael Winfield and Rob Castellano. They recently graduated from the NYC Data Science Academy 12 week full time Data Science Bootcamp program that took place between April 11th to July 1st, 2016. This post is based on their final class project - the Capstone Project, due on the 12th week of the program.
Our ultimate project goal was to complete the task assigned by the AirBnB New User Bookings Kaggle competition. That task was to predict the country of a new user’s first destination, out of 12 possible outcomes for the destination country: United States, France, Canada, Great Britain, Spain, Italy, Portugal, Netherlands, Germany, Australia, other, or no destination found (NDF). Here you can find a Shiny app that gives basic information on the possible destination countries. The code for this project can be found here.
The competition allowed up to 5 predictions for each new user in the test dataset. For example: United States, France, Australia, Italy, and Portugal. The metric used to grade these predictions was normalized discounted cumulative gain (NDCG), which measures the performance of a recommendation system based on the relevance of the recommended entries. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities. This metric is commonly used in information retrieval and to evaluate the performance of web search engines.
The Airbnb Kaggle dataset consisted of:
- User information: Unique ID, age, gender, web browser, avenue through which the user accessed AirBnB, country destination, timestamp of first activity, account created, and first booking.
- Browser session data: Unique ID, action type, and time elapsed.
Training set: 200,000 users--Jan 2010 to Jun 2014
Test set: 60,000 users--July 2014 to Sep 2014
Exploratory Data Analysis
- Country Destination By Itself
- Our first insight into the behavior of new users on AirBnB's website was that most of the bookings are split between No Destination Found (i.e., a non-booking) and the United States.
- If we had simply predicted NDF for every new user, our model would have been equivalent to the Kaggle competition benchmark:
- User Demographics
- Our second insight related to missingness in the age and gender variables in the dataset. Looking at each variable individually, we can see that missing values predominate:
- Looking at age and gender in combination, we can see that the missingness in age overlaps to some extent with the missingness in gender:
- We can also see that both age and gender -- and the related overlapping missingness -- appear to be related to the user's choice of country of destination:
Feature Engineering & Stacking
- Variable Selection for Feature Engineering
- We decided to use the date-time variables to engineer 3 features based on user booking behavior, specifically the pairwise differences in days between the creation of an AirBnB account, a user’s first activity on the website, and their date of first booking.
- Once we created a bookings variable and two time-lag variables, out-of-fold predictions of those three features were then added to the training dataset and test dataset through the process of stacking. Since the out-of-fold predictions were used, there was no concern of data leakage creating models that overfit. In Python we utilized the cross_val_predict function in cross_validation in the sklearn module to produce the out-of-fold predictions. In R, we used the savePred=T option in the caret package to generate out-of-fold predictions.
Workflow & Model Selection
With respect to our workflow, we created the engineered features using Random Forests and Gradient Boosting Machines. We then decided to train and run predictive models using XGBoost and AdaBoost. We ran XGBoost models on the unstacked data. We ran XGBoost models and AdaBoost models on the stacked data.
Why did we choose boosted tree methods?
- We were committed to using boosted-tree based methods for a few reasons:
- Boosted Trees can outperform Random Forests once they plateau.
- Tree-based methods can handle missingness very well.
- Tree-based methods can handle both categorical and continuous variables.
- The reason we chose XGBoost and AdaBoost for our predictive models is that both are variants of Gradient Boosted Machines.
- XGboost, among other things, is GBM with regularization. Regularization would be useful to us in determining feature importance across our stacked and unstacked models.
- AdaBoost, among other things, is GBM with an exponential loss function instead of a deviance loss function. We knew that if we had time to ensemble multiple models, combining AdaBoost and XGBoost models set to different set seeds might make for much better predictions.
Predicting User's First Country Destination
- Optimizing our Parameters. We used grid searches with cross-validation to optimize the tuning parameters for our various models. In all cases we noticed that either NDF or USA came in as the first choice; however, other countries were selected as lower ranked choices, as we set our NDCG metric to 5 predictions per observation. Here are our results:
- Improved Kaggle ranking from #1165 to #374 with score of 0.87055
- Best parameters: learning_rate 0.1, max_depth 4, n_estimators: 100
- Kaggle ranking of #1030 with score of 0.86332
- Kaggle ranking of #1028 with score of 0.86445
- Variable Importance in our XGBoost models:
In both the unstacked and stacked models, both missing age/gender are some of the most important features to prediction. This is consistent with our exploratory data analysis, which appears to show associations between both (i) missing age/gender and a country of first destination and (ii) missing age/gender and NDF (the latter of which we assume contributed more to the models). Counts and sum_sec_elapsed were also found to be important and are from the users' sessions data.
As a team we:
- Performed exploratory data analysis on Airbnb new user information.
- Wrangled and munged data in Python and R.
- Used R for visualization and the creation of a Shiny App.
- Feature engineered time-lag-based variables using Python and R.
- Fit models (XGBoost/Random Forest/AdaBoost) using Python.
- Performed predictions on users using XGBoost that ranked at 374 on Kaggle.
Recommendations to AirBnB
- Invest in collecting more demographic data to differentiate country destinations.
- A possible solution is to collect age and gender data from users who sign-in using their e-mail (instead of Google and Facebook) on the same page where they enter credit card information to complete a request to book a reservation. That way, AirBnB will have demographic data on all of its users who book reservations.
- Flag users who decline to enter age and gender. There is correlation between such users and those who do not book.
- Continuously collect browser session activity; such data was helpful for predictions. This data was available only for newer users.
Steps to improve our predictions:
- Optimize tuning parameters for XGBoost on the stacked dataset.
- Stack country of destination predictions to dataset as features to improve predictions.
- Use multiple XGBoost models (stacked or unstacked) with different set seeds and ensemble them.
- Design a strategy to deal with the imbalanced classes by either:
- Determining whether some NDFs in the training set are hosts on AirBnB on the basis of their web activity, or
- Predicting NDF versus all other classes; then US versus all other classes; and so on until all the classes are predicted individually. Then add those predictions as new features.