Kaggle Renthop

Posted on Mar 13, 2017

Contributed by Royce Ho and Tingting Chang. They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 31th, 2017. This post is based on their third class project - Machine Learning (due on the 8th week of the program).

Royce Ho: GitHub | LinkedIn

Tingting Chang: GitHub | LinkedIn

Predicting Rental Listing Inquiries on RentHop

Finding apartments for rent is usually a challenging task. RentHop, one of many websites that try to make the process more convenient, tries to help renters by sorting their listing by quality using data. To improve their methods and to better understand the needs and preferences of renters, RentHop, along with Two Sigma, hosted a competition on Kaggle to predict the number of inquiries a new listing will receive based on its features. We took the challenge to see if we could accurately predict the listing’s interest level based on the data provided.

Below is an overview of what we did:


Overview of project

About the Data

The participants of the competition are given two datasets: one for training, containing approximately 50,000, listings and one for testing, with approximately 75,000 listings in the set. The goal is to predict the interest level of each listing based on the thirteen features provided. The features include information about the location of the apartment, time and date the listing was created, description, IDs relating to the building and manager, and basic apartment information. Photos submitted for each listing has also been included for analysis.

Screen Shot 2017-03-12 at 9.20.54 PM

The interest levels are split into three categories: high, medium and low. As the graph indicates there are a lot more listings with low interest listings than medium and high listings, and the number of high interest listings is very low, even compared to medium.


Correlation between number of bathrooms and bedrooms, and price

When exploring the data, we always need to take into account correlation between predictor variables. Our basic  instincts tells us that bedrooms, bathrooms and price might be related. After further investigation, however, we realized that it probably wouldn’t affect our model too much.


Distribution of hours and interest level

One of the most interesting things we noted when learning about our data was that it seems like the hour when the listing was posted plays a role in determining whether the interest level would be high, medium or low. It was also interesting to note that most of the listings occur during the middle of the night, which might be due to companies automatically setting up their systems to post on renting sites. The proximity of the listings are to the start of the workday seems to correlate more with higher interest levels, which is likely due to way RentHop posts their listings on the front page. The newer listings are likely to show up on the front page, which, in turn, lead to a higher interest level.

Feature Engineering

Although we were given a bit of information for predicting interest level, we wanted to include more information to help our predictions. Because the rules of the competition do not allow outside data to be used, we had to be a little creative when creating new features out of the old. Our strategy was to create as many new features as we could then slowly eliminate features that were not useful for predictions. We broke down the features into smaller, related categories to simplify the process.

Basic Apartment Information

When looking for an apartment, every renter cares about the price and number of rooms. The dataset doesn’t specify the total number of rooms but it does include the number of bathrooms and bedrooms. Although the bedrooms and bathrooms are numerical, we thought it might be better to treat the information as categorical data so we decided to have the option to dummify those features. Using the bedrooms, bathrooms, and price features, we also created new features describing the ratio between all combinations of the three.



The dataset includes many ways of describing the location of each apartment. They include latitude and longitude coordinates, the display address and the street address. Using the street address, we filled in missing or incorrect coordinates and also categorized the street types. We noted if the address contained a direction (north, east, south, or west) and if it was a street or an avenue. We wanted to get the neighborhood of each listing but were unsuccessful due to hitting the query limit for Google’s API. Instead, we decided to group locations in two ways. The first was by simple latitude and longitude boxes and the second was by K-Means clustering. We grouped the listings into twenty-five different clusters.


Description and Apartment Features

The descriptions and apartment features are text data so they were trickier to work with. We used TF-IDF to get the numeric values for the frequent words shows up in "features” and "descriptions", then got the mean for every observation. We then used LDA Topic Modeling to get seven topics for every observation based on their TF-IDF values. Subsequently, we used Google's Word2vec tool to train every word in "features” and "descriptions" and got 100 numeric columns for every word. Then we calculated the mean for all the words in every observation. We also obtained the word and character count for description and the count for apartment features.

Screen Shot 2017-03-11 at 2.59.47 PM

Numeric vectors by Word2vec

Building and Manager ID’s

We grouped the manager and building ID by frequency. If the frequency was within a certain percentile, they would be a part of that group. We also had the option to dummify the ID’s.


Due to time constraints and limited processing power, we decided not to focus too much on the photos provided. We included the number of photos for each listing and tried to extract basic information about each photo. For each image we obtained the minimum, maximum and mean of the dimensional properties and the brightness.

Time Listing was Created

We simply broke up the listing created feature into month created, weekday created, day created, and hour created. We also had the option to categorize and dummify these new features.

Modeling and Predicting

For all our models, we dealt with categorical features by dummifying them.


Machine learning used in project

Logistic Regression

To start, we created a logistic regression model. We included all the numeric features such as numeric vectors from Word2vec, TF-IDF, Topic Modeling. However, after we submitted on the Kaggle, our results were shockingly bad. After more visualizations, we realized that there was too much multicollinearity between some columns like Word2vec. When we deleted those highly correlated columns, our results improved. We also had many other variables that were highly correlated with each other because of the way we created our extra features, so we carefully selected the features for our model.  We tuned our model using cross validation and grid searching by checking a range of parameters and also testing both L1 and L2 regularization.


Highly correlated word2vec vectors

Random Forest

Although we started with linear regression, we focused most of our time on classification trees. This was logical because this was a classification problem and our initial research showed that none of the features showed strong correlations with the response variables. As it appeared that logistic regression was not the best fit model for this problem, we attempted to set up to a random forest using our full dataset, we, included columns that were similar; and used grid search and cross validation to tune the parameters. Our results were decent, much better than the results obtained from logistic regression.

Our basic assumption about using trees was that the more predictors, the better, which is why we went with an all-out approach when creating features. In reality, our results had actually improved when we removed a lot of features, specifically some of the dummies of categorical variables with many categories and our word2vec for description and apartment features. Because random forest selects a subset of predictors to choose the next best split, having many more columns that are not helpful will make it harder for the random forest to choose useful predictors for splits.


When doing Kaggle, xgboosting is a must. It provides better results in a way that is not computationally expensive. We did not have the computational power or time to tune the model so we used parameters obtained from a public kernel for this competition. We used the reduced dataset from our random forest to obtain our best results with xgboost. When evaluating the feature importance for our model,  we see many obvious features that go into deciding interest level like the ratios of basic apartment features and also location. It is surprising to see that hour the listing was created played a significant role in determinine interest levels.


Top 20 most important features

Ensemble Models

We did a simple ensemble with our models using weighted averages. We manually picked the weights according to our score on Kaggle. Our xgboost models were weighted the most, and the logistic regression model was weighted the least. This slightly improved our results.

Future Work

Our models worked out fairly well, but can always be improved. Because of our all-out approach for feature engineering, we ended up creating a lot of similar features. We can limit the features more by carefully selecting the features for our models. At the very least, we hope to improve our logistic regression model by reducing multicollinearity. We would also like to explore the use of  more unsupervised learning techniques like principal component analysis and possibly build a neural network to evaluate the photos. When we finish selecting our features we would like to optimize the parameters better for each model. Instead of using the standard grid search, we would like to try to implement Bayesian Optimization for selecting our parameters. Of course, our major goal for the future is to improve our predictions.

About Author

Tingting Chang

Tingting Chang got her master degree in Computer Science from the George Washington University. She is a self-starter and hardworking data scientist well equipped with data analytics skills to obtain actionable insights from massive datasets without losing sight...
View all posts by Tingting Chang >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI