Contributed by Royce Ho and Tingting Chang. They are currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between Jan 9th to March 31th, 2017. This post is based on their third class project - Machine Learning (due on the 8th week of the program).
Royce Ho: GitHub | LinkedIn
Tingting Chang: GitHub | LinkedIn
Predicting Rental Listing Inquiries on RentHop
Finding apartments for rent is usually a challenging task. RentHop, one of many websites that try to make the process more convenient, tries to help renters by sorting their listing by quality using data. To improve their methods and to better understand the needs and preferences of renters, RentHop, along with Two Sigma, hosted a competition on Kaggle to predict the number of inquiries a new listing will receive based on its features. We took the challenge to see if we could accurately predict the listing’s interest level based on the data provided.
Below is an overview of what we did:
About the Data
The participants of the competition are given two datasets: one for training, containing approximately 50,000, listings and one for testing, with approximately 75,000 listings in the set. The goal is to predict the interest level of each listing based on the thirteen features provided. The features include information about the location of the apartment, time and date the listing was created, description, IDs relating to the building and manager, and basic apartment information. Photos submitted for each listing has also been included for analysis.
The interest levels are split into three categories: high, medium and low. As the graph indicates there are a lot more listings with low interest listings than medium and high listings, and the number of high interest listings is very low, even compared to medium.
When exploring the data, we always need to take into account correlation between predictor variables. Our basic instincts tells us that bedrooms, bathrooms and price might be related. After further investigation, however, we realized that it probably wouldn’t affect our model too much.
One of the most interesting things we noted when learning about our data was that it seems like the hour when the listing was posted plays a role in determining whether the interest level would be high, medium or low. It was also interesting to note that most of the listings occur during the middle of the night, which might be due to companies automatically setting up their systems to post on renting sites. The proximity of the listings are to the start of the workday seems to correlate more with higher interest levels, which is likely due to way RentHop posts their listings on the front page. The newer listings are likely to show up on the front page, which, in turn, lead to a higher interest level.
Although we were given a bit of information for predicting interest level, we wanted to include more information to help our predictions. Because the rules of the competition do not allow outside data to be used, we had to be a little creative when creating new features out of the old. Our strategy was to create as many new features as we could then slowly eliminate features that were not useful for predictions. We broke down the features into smaller, related categories to simplify the process.
Basic Apartment Information
When looking for an apartment, every renter cares about the price and number of rooms. The dataset doesn’t specify the total number of rooms but it does include the number of bathrooms and bedrooms. Although the bedrooms and bathrooms are numerical, we thought it might be better to treat the information as categorical data so we decided to have the option to dummify those features. Using the bedrooms, bathrooms, and price features, we also created new features describing the ratio between all combinations of the three.
The dataset includes many ways of describing the location of each apartment. They include latitude and longitude coordinates, the display address and the street address. Using the street address, we filled in missing or incorrect coordinates and also categorized the street types. We noted if the address contained a direction (north, east, south, or west) and if it was a street or an avenue. We wanted to get the neighborhood of each listing but were unsuccessful due to hitting the query limit for Google’s API. Instead, we decided to group locations in two ways. The first was by simple latitude and longitude boxes and the second was by K-Means clustering. We grouped the listings into twenty-five different clusters.
Description and Apartment Features
The descriptions and apartment features are text data so they were trickier to work with. We used TF-IDF to get the numeric values for the frequent words shows up in "features” and "descriptions", then got the mean for every observation. We then used LDA Topic Modeling to get seven topics for every observation based on their TF-IDF values. Subsequently, we used Google's Word2vec tool to train every word in "features” and "descriptions" and got 100 numeric columns for every word. Then we calculated the mean for all the words in every observation. We also obtained the word and character count for description and the count for apartment features.
Building and Manager ID’s
We grouped the manager and building ID by frequency. If the frequency was within a certain percentile, they would be a part of that group. We also had the option to dummify the ID’s.
Due to time constraints and limited processing power, we decided not to focus too much on the photos provided. We included the number of photos for each listing and tried to extract basic information about each photo. For each image we obtained the minimum, maximum and mean of the dimensional properties and the brightness.
Time Listing was Created
We simply broke up the listing created feature into month created, weekday created, day created, and hour created. We also had the option to categorize and dummify these new features.
Modeling and Predicting
For all our models, we dealt with categorical features by dummifying them.
To start, we created a logistic regression model. We included all the numeric features such as numeric vectors from Word2vec, TF-IDF, Topic Modeling. However, after we submitted on the Kaggle, our results were shockingly bad. After more visualizations, we realized that there was too much multicollinearity between some columns like Word2vec. When we deleted those highly correlated columns, our results improved. We also had many other variables that were highly correlated with each other because of the way we created our extra features, so we carefully selected the features for our model. We tuned our model using cross validation and grid searching by checking a range of parameters and also testing both L1 and L2 regularization.
Although we started with linear regression, we focused most of our time on classification trees. This was logical because this was a classification problem and our initial research showed that none of the features showed strong correlations with the response variables. As it appeared that logistic regression was not the best fit model for this problem, we attempted to set up to a random forest using our full dataset, we, included columns that were similar; and used grid search and cross validation to tune the parameters. Our results were decent, much better than the results obtained from logistic regression.
Our basic assumption about using trees was that the more predictors, the better, which is why we went with an all-out approach when creating features. In reality, our results had actually improved when we removed a lot of features, specifically some of the dummies of categorical variables with many categories and our word2vec for description and apartment features. Because random forest selects a subset of predictors to choose the next best split, having many more columns that are not helpful will make it harder for the random forest to choose useful predictors for splits.
When doing Kaggle, xgboosting is a must. It provides better results in a way that is not computationally expensive. We did not have the computational power or time to tune the model so we used parameters obtained from a public kernel for this competition. We used the reduced dataset from our random forest to obtain our best results with xgboost. When evaluating the feature importance for our model, we see many obvious features that go into deciding interest level like the ratios of basic apartment features and also location. It is surprising to see that hour the listing was created played a significant role in determinine interest levels.
We did a simple ensemble with our models using weighted averages. We manually picked the weights according to our score on Kaggle. Our xgboost models were weighted the most, and the logistic regression model was weighted the least. This slightly improved our results.
Our models worked out fairly well, but can always be improved. Because of our all-out approach for feature engineering, we ended up creating a lot of similar features. We can limit the features more by carefully selecting the features for our models. At the very least, we hope to improve our logistic regression model by reducing multicollinearity. We would also like to explore the use of more unsupervised learning techniques like principal component analysis and possibly build a neural network to evaluate the photos. When we finish selecting our features we would like to optimize the parameters better for each model. Instead of using the standard grid search, we would like to try to implement Bayesian Optimization for selecting our parameters. Of course, our major goal for the future is to improve our predictions.