Predicting Rental Interest: A Kaggle Competition

Posted on Mar 6, 2017

data-science_large-hero_620x260 (1)

RentHop Intro

Kaggle, a data science competition network recently acquired by Google, is home to many machine learning competitions of various types and difficulties. One of their more popular contests involves predicting the amount of interest (Low, Medium, or High) a particular rental listing will receive. To compete in this competition, I created an XGBoost-based model that ultimately scored .55724, good enough for first place in our cohort. Lets take a look at the process and the mistakes made along the way.

The Process

The listings contained the following information: Bathrooms, Bedrooms, Building ID, Date Created, Description (text), Address, Features, Lat/Long, Manager ID, Photos, and Price. Some basic EDA revealed that 70% of the listings received Low interest. As such, I oriented my thinking to identifying/creating features that would suggest higher interest than the baseline. First, I created some naive features, such as price divided by bedrooms, bathrooms, features, and photos. I then ran these features plus the original data set through XGBoost and received a log loss of .73. The chase was on.

In order to achieve greater progress, I felt I had to analyze the problem beyond just the given dataset. Most data science problems represent the backend of a common problem from everyday life. Often, a key to creating a working model is to approach the problem from the end user's perspective.

In this case, we are essentially examining what attributes makes a home more desirable. The lat/longs of the rental listings revealed them to be New York City apartments, primarily in Manhattan and nearby Brooklyn. In NYC, I suspected that neighborhood, features, price, distance to subway stops, distance to schools, and safety would all be key factors in a listing's interest. In the limited time available, I set out to encode these features.


First, I one-hot encoded the top 50 features, performed PCA, and reduced down to the 30 or so features that were statistically significant. I then found the lat longs for NYC neighborhoods, and created a column to display the neighborhood as a feature. I one hot encoded the addresses to show whether the listing was on an Ave or Street. By this time, the model was running a sub .60 log loss. Finally, I found an open data set provided by the New York government listing the locations of all 1000+ subway stops in NYC. I created a distance matrix to display the nearest stop to each listing, along with the distance to that stop. This improved the model, but not as much as I would have hoped. I thought more about it, and realized I was missing a key component. Living next to a subway stop is great, but the key is living next to a subway stop that contains multiple lines connecting you with the rest of the city. I decided to create a new feature: distance to subway stop divided by the number of lines at that particular stop (to amplify the "less is more" aspect of the original distance-to-subway feature). The result was a model with a  much improved .573 log loss.


Next, I took a look at density maps of NYC in general. Like most cities, the density in NYC is not linear. There are pockets of extreme density/demand that may be just a few blocks from a (comparatively) undesirable area. It would be nearly impossible to encode all of the desired areas into the model, and not time efficient considering our tight deadline. Instead, I decided to use a proxy. What is located in high traffic areas? What depends on foot traffic and high density for its business model? While looking at pictures of high density streets, the answer was obvious. Every fifth person was carrying their signature cup: Starbucks. There are 20,000 Starbucks locations on earth, and almost 500 alone in the NYC area. I created another feature showing the listing's distance to the nearest Starbucks and ran the model: Improvement, but very little. Again, I had a good idea that needed more thorough follow through. Thinking about it conceptually, I imagined a listing on the western or eastern edge of Manhattan. There could be one Starbucks relatively close, but the next nearest Starbucks would be probably twice as far away, and likely in the same direction. In a popular area, however, the second closest Starbucks would theoretically be in a different direction, and maybe just slightly farther away. I decided that the average distance of the nearest four Starbucks would probably be a good proxy for the "centralness" of a listing. Encoding that feature pushed the model all the way to .561. With a little tuning (slower learning rate, more trees), our XGBoost model achieved its top score of .55724.

The Mistakes

How much time do you have? There were many, MANY misguided features added (and subsequently removed) and attempted model blends that did not improve my score. Here are a few:

  • Lasso, Ridge, and Elastic Regression - I attempted to use these models as they are strong in aspects that XGBoost is weak, possibly making for a good pairing. In this case, that theory was false
  • Sentiment Analysis - Analyzing the description for mood was a simple (and cool) bit of code that ultimately did nothing
  • Street or Avenue - The model improved when this feature was removed, go figure
  • Image Brightness - I tried to run a basic scan of the images to see how bright they were, as a proxy for whether or not they were taken professionally. The model would have taken 30+ hours to run, crowding out all other analysis time, so I aborted.
  • Junk Listings - There were several listings with incorrect or incomplete data. I tried to encode a binary "junk" feature, but it was useless. This was probably because these listings were nearly always Low interest, which was the baseline in any case.
  • Schools - I also created a distance matrix to the nearest top 45 elementary, middle, and high schools in the area. As soon as I pulled and edited that data sheet, I should have known it wouldn't be successful. All the best schools were in the worst neighborhoods.

I also attempted to reduce the entire model using PCA, and blend it with linear/generalized linear models. None of these strategies provided any improvement. In the end, this competition reinforced the idea that thinking outside of the box is just as important as pure model engineering. After this project, I am greatly looking forward to business challenges that can be solved by advanced modeling in the future.

About Author

Mayank Shah

Financial analyst -> Educational startup founder -> Facebook Analyst -> Data Scientist
View all posts by Mayank Shah >

Related Articles

Leave a Comment

Merry May 5, 2017
Really excellent post, I surely adore this website, keep on it.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp