Using Data to Predict a New User's First Travel Destination

, and
Posted on Jul 1, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Michael Winfield and Rob Castellano. They recently graduated from the NYC Data Science Academy 12 week full time Data Science Bootcamp program that took place between April 11th to July 1st, 2016. This post is based on their final class project - the Capstone Project, due on the 12th week of the program.

Project Goal

Our ultimate project goal was to complete the task assigned by the AirBnB New User Bookings Kaggle competition. That task was to used data to predict the country of a new user’s first destination, out of 12 possible outcomes for the destination country: United States, France, Canada, Great Britain, Spain, Italy, Portugal, Netherlands, Germany, Australia, other, or no destination found (NDF). Here you can find a Shiny app that gives basic information on the possible destination countries. The code for this project can be found here.

Using Data to Predict a New User's First Travel Destination

Metrics

The competition allowed up to 5 predictions for each new user in the test dataset. For example: United States, France, Australia, Italy, and Portugal. The metric used to grade these predictions was normalized discounted cumulative gain (NDCG), which measures the performance of a recommendation system based on the relevance of the recommended entries. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities. This metric is commonly used in information retrieval and to evaluate the performance of web search engines.

The Data Set

The Airbnb Kaggle dataset consisted of:

  • User information: Unique ID, age, gender, web browser, avenue through which the user accessed AirBnB, country destination, timestamp of first activity, account created, and first booking.
  • Browser session data: Unique ID, action type, and time elapsed.

Training set: 200,000 users--Jan 2010 to Jun 2014
Test set: 60,000 users--July 2014 to Sep 2014

Exploratory Data Analysis

  1. Country Destination By Itself

    • Our first insight into the behavior of new users on AirBnB's website was that most of the bookings are split between No Destination Found (i.e., a non-booking) and the United States.Using Data to Predict a New User's First Travel Destination
    • If we had simply predicted NDF for every new user, our model would have been equivalent to the Kaggle competition benchmark:Using Data to Predict a New User's First Travel Destination

      2. User Demographic

    • Our second insight related to missingness in the age and gender variables in the dataset. Looking at each variable individually, we can see that missing values predominate:
      1. Using Data to Predict a New User's First Travel Destination         Age  DistributionAgeDist

        • Looking at age and gender in combination, we can see that the missingness in age overlaps to some extent with the missingness in gender:                                                                                                                                                                                      AgeGenderPctActualMissingness
        • We can also see that both age and gender -- and the related overlapping missingness -- appear to be related to the user's choice of country of destination:                                                                                                                                                             CountryGenderPct
        • CountryAgePct

Feature Engineering & Stacking

  1. Variable Selection for Feature Engineering

    • We decided to use the date-time variables to engineer 3 features based on user booking behavior, specifically the pairwise differences in days between the creation of an AirBnB account, a user’s first activity on the website, and their date of first booking.                                                                                                                                                                                                                                                                                                                                                                LagBookingPctCountryBookingsPct
    • Stacking Data

    • Once we created a bookings variable and two time-lag variables, out-of-fold predictions of those three features were then added to the training dataset and test dataset through the process of stacking. Since the out-of-fold predictions were used, there was no concern of data leakage creating models that overfit. In Python we utilized the cross_val_predict function in cross_validation in the sklearn module to produce the out-of-fold predictions. In R, we used the savePred=T option in the caret package to generate out-of-fold predictions.
                                                                                                                                                                                                                                                                                                                                                                            

Workflow & Model Selection

With respect to our workflow, we created the engineered features using Random Forests and Gradient Boosting Machines. Then, we decided to train and run predictive models using XGBoost and AdaBoost. We ran XGBoost models on the unstacked data. We ran XGBoost models and AdaBoost models on the stacked data.

Workflow

Why did we choose boosted tree methods?

  • We were committed to using boosted-tree based methods for a few reasons:
    • Boosted Trees can outperform Random Forests once they plateau.
    • Tree-based methods can handle missingness very well.
    • Tree-based methods can handle both categorical and continuous variables.
  •  The reason we chose XGBoost and AdaBoost for our predictive models is that both are variants of Gradient Boosted Machines.
    • XGboost, among other things, is GBM with regularization. Regularization would be useful to us in determining feature importance across our stacked and unstacked models.
    • AdaBoost, among other things, is GBM with an exponential loss function instead of a deviance loss function. We knew that if we had time to ensemble multiple models, combining AdaBoost and XGBoost models set to different set seeds might make for much better predictions.

Predicting User's First  Country Destination

  • Optimizing our Parameters. We used grid searches with cross-validation to optimize the tuning parameters for our various models. In all cases we noticed that either NDF or USA came in as the first choice; however, other countries were selected as lower ranked choices, as we set our NDCG metric to 5  predictions per observation. Here are our results:
    • Unstacked:
      • XGBoost
        • Improved Kaggle ranking from #1165 to #374 with score of 0.87055
        • Best parameters: learning_rate 0.1, max_depth 4, n_estimators: 100
    •  Stacked:
      • XGBoost
        • Kaggle ranking of #1030 with score of 0.86332
      • AdaBoost
        • Kaggle ranking of #1028 with score of 0.86445
  • Variable Importance in our XGBoost models:

Unstacked

Stacked

In both the unstacked and stacked models, both missing age/gender are some of the most important features to prediction. This is consistent with our exploratory data analysis, which appears to show associations between both (i) missing age/gender and a country of first destination and (ii) missing age/gender and NDF (the latter of which we assume contributed more to the models). Counts and sum_sec_elapsed were also found to be important and are from the users' sessions data.

Conclusions

As a team we:

  1. Performed exploratory data analysis on Airbnb new user information.
  2. Wrangled and munged data in Python and R.
  3. Used R for visualization and the creation of a Shiny App.
  4. Feature engineered time-lag-based variables using Python and R.
  5. Fit models (XGBoost/Random Forest/AdaBoost) using Python.
  6. Performed predictions on users using XGBoost that ranked at 374 on Kaggle.

Recommendations to AirBnB

Brand_Case_Study3

  • Invest in collecting more demographic data to differentiate country destinations.
    • A possible solution is to collect age and gender data from users who sign-in using their e-mail (instead of Google and Facebook) on the same page where they enter credit card information to complete a request to book a reservation. That way, AirBnB will have demographic data on all of its users who book reservations.
  • Flag users who decline to enter age and gender. There is correlation between such users and those who do not book.
  • Continuously collect browser session activity; such data was helpful for predictions.  This data was available only for newer users.

Future Directions

Steps to improve our predictions:

  • Optimize tuning parameters for XGBoost on the stacked dataset.
  • Stack country of destination predictions to dataset as features to improve predictions.
  • Use multiple XGBoost models (stacked or unstacked) with different set seeds and ensemble them.
  • Design a strategy to deal with the imbalanced classes by either:
    • Determining whether some NDFs in the training set are hosts on AirBnB on the basis of their web activity, or
    • Predicting NDF versus all other classes; then US versus all other classes; and so on until all the classes are predicted individually. Then add those predictions as new features.

About Authors

Michael Winfield

Michael has a passion for finding strategic insights for businesses, managers, and organizations engaged in competitive dynamics. With a background in corporate litigation and white collar criminal defense, as well as graduate-level education in strategic management, Michael is...
View all posts by Michael Winfield >

Rob Castellano

Rob recently received his Ph.D. in Mathematics from Columbia. His training as a pure mathematician has given him strong quantitative skills and experience in using creative problem solving techniques. He has experience conveying abstract concepts to both experts...
View all posts by Rob Castellano >

Related Articles

Leave a Comment

تحميل مهرجانات شعبى June 19, 2017
What's up,I check your new stuff named "Predicting a New User's First Travel Destination on AirBnB (Capstone Project) - NYC Data Science Academy BlogNYC Data Science Academy Blog" on a regular basis.Your writing style is witty, keep up the good work! And you can look our website about تحميل مهرجانات شعبى.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI