Predicting a New User's First Travel Destination on AirBnB (Capstone Project)

Contributed by Michael Winfield, Rob Castellano, Yannick Kimmel. They recently graduated from the NYC Data Science Academy 12 week full time Data Science Bootcamp program that took place between April 11th to July 1st, 2016. This post is based on their final class project - the Capstone Project, due on the 12th week of the program.

Project Goal

Our ultimate project goal was to complete the task assigned by the AirBnB New User Bookings Kaggle competition. That task was to predict the country of a new user’s first destination, out of 12 possible outcomes for the destination country: United States, France, Canada, Great Britain, Spain, Italy, Portugal, Netherlands, Germany, Australia, other, or no destination found (NDF). Here you can find a Shiny app that gives basic information on the possible destination countries. The code for this project can be found here.

Map

The competition allowed up to 5 predictions for each new user in the test dataset. For example: United States, France, Australia, Italy, and Portugal. The metric used to grade these predictions was normalized discounted cumulative gain (NDCG), which measures the performance of a recommendation system based on the relevance of the recommended entries. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities. This metric is commonly used in information retrieval and to evaluate the performance of web search engines.

The Dataset

The Airbnb Kaggle dataset consisted of:

  • User information: Unique ID, age, gender, web browser, avenue through which the user accessed AirBnB, country destination, timestamp of first activity, account created, and first booking.
  • Browser session data: Unique ID, action type, and time elapsed.

Training set: 200,000 users--Jan 2010 to Jun 2014
Test set: 60,000 users--July 2014 to Sep 2014

Exploratory Data Analysis

  1. Country Destination By Itself
    • Our first insight into the behavior of new users on AirBnB's website was that most of the bookings are split between No Destination Found (i.e., a non-booking) and the United States.CountryDist
    • If we had simply predicted NDF for every new user, our model would have been equivalent to the Kaggle competition benchmark:Benchmark
  2. User Demographics
    • Our second insight related to missingness in the age and gender variables in the dataset. Looking at each variable individually, we can see that missing values predominate:                                                                                                                 GenderDist         AgeDist
    • Looking at age and gender in combination, we can see that the missingness in age overlaps to some extent with the missingness in gender:                                                                                                                                                                                      AgeGenderPctActualMissingness
    • We can also see that both age and gender -- and the related overlapping missingness -- appear to be related to the user's choice of country of destination:                                                                                                                                                             CountryGenderPct
    • CountryAgePct

Feature Engineering & Stacking

  1. Variable Selection for Feature Engineering
    • We decided to use the date-time variables to engineer 3 features based on user booking behavior, specifically the pairwise differences in days between the creation of an AirBnB account, a user’s first activity on the website, and their date of first booking.                                                                                                                                                                                                                                                                                                                                                                LagBookingPctCountryBookingsPct
    • Once we created a bookings variable and two time-lag variables, out-of-fold predictions of those three features were then added to the training dataset and test dataset through the process of stacking. Since the out-of-fold predictions were used, there was no concern of data leakage creating models that overfit. In Python we utilized the cross_val_predict function in cross_validation in the sklearn module to produce the out-of-fold predictions. In R, we used the savePred=T option in the caret package to generate out-of-fold predictions.
                                                                                                                                                                                                                                                                                                                                                                            

Workflow & Model Selection

With respect to our workflow, we created the engineered features using Random Forests and Gradient Boosting Machines. We then decided to train and run predictive models using XGBoost and AdaBoost. We ran XGBoost models on the unstacked data. We ran XGBoost models and AdaBoost models on the stacked data.

Workflow

Why did we choose boosted tree methods?

  • We were committed to using boosted-tree based methods for a few reasons:
    • Boosted Trees can outperform Random Forests once they plateau.
    • Tree-based methods can handle missingness very well.
    • Tree-based methods can handle both categorical and continuous variables.
  •  The reason we chose XGBoost and AdaBoost for our predictive models is that both are variants of Gradient Boosted Machines.
    • XGboost, among other things, is GBM with regularization. Regularization would be useful to us in determining feature importance across our stacked and unstacked models.
    • AdaBoost, among other things, is GBM with an exponential loss function instead of a deviance loss function. We knew that if we had time to ensemble multiple models, combining AdaBoost and XGBoost models set to different set seeds might make for much better predictions.

Predicting User's First  Country Destination

  • Optimizing our Parameters. We used grid searches with cross-validation to optimize the tuning parameters for our various models. In all cases we noticed that either NDF or USA came in as the first choice; however, other countries were selected as lower ranked choices, as we set our NDCG metric to 5  predictions per observation. Here are our results:
    • Unstacked:
      • XGBoost
        • Improved Kaggle ranking from #1165 to #374 with score of 0.87055
        • Best parameters: learning_rate 0.1, max_depth 4, n_estimators: 100
    •  Stacked:
      • XGBoost
        • Kaggle ranking of #1030 with score of 0.86332
      • AdaBoost
        • Kaggle ranking of #1028 with score of 0.86445
  • Variable Importance in our XGBoost models:

Unstacked

Stacked

In both the unstacked and stacked models, both missing age/gender are some of the most important features to prediction. This is consistent with our exploratory data analysis, which appears to show associations between both (i) missing age/gender and a country of first destination and (ii) missing age/gender and NDF (the latter of which we assume contributed more to the models). Counts and sum_sec_elapsed were also found to be important and are from the users' sessions data.

Conclusions

As a team we:

  1. Performed exploratory data analysis on Airbnb new user information.
  2. Wrangled and munged data in Python and R.
  3. Used R for visualization and the creation of a Shiny App.
  4. Feature engineered time-lag-based variables using Python and R.
  5. Fit models (XGBoost/Random Forest/AdaBoost) using Python.
  6. Performed predictions on users using XGBoost that ranked at 374 on Kaggle.

Recommendations to AirBnB

Brand_Case_Study3

  • Invest in collecting more demographic data to differentiate country destinations.
    • A possible solution is to collect age and gender data from users who sign-in using their e-mail (instead of Google and Facebook) on the same page where they enter credit card information to complete a request to book a reservation. That way, AirBnB will have demographic data on all of its users who book reservations.
  • Flag users who decline to enter age and gender. There is correlation between such users and those who do not book.
  • Continuously collect browser session activity; such data was helpful for predictions.  This data was available only for newer users.

Future Directions

Steps to improve our predictions:

  • Optimize tuning parameters for XGBoost on the stacked dataset.
  • Stack country of destination predictions to dataset as features to improve predictions.
  • Use multiple XGBoost models (stacked or unstacked) with different set seeds and ensemble them.
  • Design a strategy to deal with the imbalanced classes by either:
    • Determining whether some NDFs in the training set are hosts on AirBnB on the basis of their web activity, or
    • Predicting NDF versus all other classes; then US versus all other classes; and so on until all the classes are predicted individually. Then add those predictions as new features.

About Authors

Michael Winfield

Michael Winfield

Michael has a passion for finding strategic insights for businesses, managers, and organizations engaged in competitive dynamics. With a background in corporate litigation and white collar criminal defense, as well as graduate-level education in strategic management, Michael is...
View all posts by Michael Winfield >
Rob Castellano

Rob Castellano

Rob recently received his Ph.D. in Mathematics from Columbia. His training as a pure mathematician has given him strong quantitative skills and experience in using creative problem solving techniques. He has experience conveying abstract concepts to both experts...
View all posts by Rob Castellano >
Yannick Kimmel

Yannick Kimmel

Yannick is drawn to solving a wide range of problems - from the traditional sciences to current challenges in data science and machine learning. Yannick holds a PhD in chemical engineering from the University of Delaware, and a...
View all posts by Yannick Kimmel >

Related Articles

Leave a Comment

Avatar
تحميل مهرجانات شعبى June 19, 2017
What's up,I check your new stuff named "Predicting a New User's First Travel Destination on AirBnB (Capstone Project) - NYC Data Science Academy BlogNYC Data Science Academy Blog" on a regular basis.Your writing style is witty, keep up the good work! And you can look our website about تحميل مهرجانات شعبى.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp