NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Big Data > Predicting Interest for NYC Apartment Rental Listings - A Guideline For Landlords and Agents

Predicting Interest for NYC Apartment Rental Listings - A Guideline For Landlords and Agents

Carlos Salas Najera, Tom Hunter, Drace Zhan and Jake Bialer
Posted on Mar 7, 2017

The objective of the following research is to predict the number of inquiries a new listing receives in NYC based on the dataset provided by RentHop. Identifying the level of interest using multiple features for each listing would assist RentHop in the attainment of the following business targets:

  • Optimize the way RentHop handles fraud control
  • Identify much easily potential listing quality issues
  • Allow owners and agents to better understand rentersโ€™ needs and preferences.

RentHop data comprises 49,352 observations for the training dataset and 74,659 for the official test dataset for rental listings in the city of New York for the period of April to June 2016. For each listing sample, there is a total number of 14 explanatory variables featuring multiple characteristics of the properties as presented below:

  • 3 float type variables: number of Bathrooms, Latitude and Longitude
  • 3 integer type variables: number of Bedrooms, listing_id and listing price
  • 6 string type variables: building_id, date listing created, description, display address, street address and manager_id,
  • 2 list type variables: features and photos.

RHProcess
 

The response variable, interest level, is a categorical type with classes/levels: "high interest", "medium interest" and "low interest". As visualized below, "low interest"   is the most representative class in the training set with almost 70% included within this bucket.

Picture1

The primary metrics of evaluation is multi-class log-loss. The formula is showed below. Secondary measures of appraisal have been used depending on the model tested. The multi-class-log measures the difference between the distribution of actual labels and the classifier probabilities. A best case classifier with 100% accuracy will have a 0 log-loss, while a classifier that assigns each observation to a k labels in a random fashion (prob = 1 /k) will have a log-loss of -log(1/k) tantamount to log(k).

Picture2

Where N is the number of listings in the test set, M is the number of class labels (3 classes),  log is the natural logarithm, yi,j is 1 if observation belongs to class j and 0 otherwise, and pi,j is the predicted probability that observation i belongs to class j.

Two things worth considering at this early stage with regards maximizing prediction accuracy and minimizing the generalization error:

  1. The predicting model will have to ensure from the outset a minimization of the "low interest" class misclassification error.
  2. The predicting model will have to identify nuances between "high interest" and "medium interest" listings in order to gain  a superior edge in terms of prediction accuracy.

New York Apartments for Rent: A Brief Industry Overview

The team dug into the NYC apartment rental industry specifics in order to understand the competitive structure, seasonal trends and overall degree of competitiveness of this particular real estate geographic market. Unfortunately it became apparent that there were important key features that could not be analyzed due to a lack of related data in RentHop's dataset:
  1. Listing Type Information: Key aspects of the NYC apartment rental market at a competitive level such as licensed professional real estate agents aka brokers that collect commissions, called broker fees. While some listings are no fee apartments, the majority of rentals, the tenant pays the broker fee. Finally, many rental brokers show open listings, meaning they do not have exclusive rights to their own inventory may also be in the inventory of a competing broker. Regrettably, the data provided by the RentHop does not include a feature describing the type of agent who posted the listing post that, undoubtedly, could have added significant value to our predicting model were it to be included.
  2. Seasonality: Apartment rental industry season normally softens by the end of each season (around September) with rental prices trending downward. However, the New York City rental market is very different from other big cities do to the structural supply deficit that mitigates the aforementioned seasonal effect, even though it does not eliminate it completely. The data provided by RentHop only spans from April to June 2016, which precluded the possibility of assessing the seasonal impact on listings' interest levels.
  3. Summer Season: Since the subprime crisis of 2007, personal income growth has lagged renting rates damaging tenants affordability levels. The cumulative effect of these events built up to a state where vacancy rates rose every month from July to September of 2015, a pattern that repeated in the 2016 summer season. This summer effect has become more significant of late, yet the lack of data for this particular time span prevented the team from testing it.

EDA: Highlights and New Features Engineering

The team's EDA (Exploratory Data Analysis) set up new features created out of the raw data in order to test specific ex-ante null hypothesis. Some of these new variables proved to be really useful in models while others failed to explain interest level diversity:

Bedrooms: Intuitively, there should be some predictive value embedded in this variable for situations of apparent mispricing given a certain number of bedrooms. The number of bedrooms played an integral role in some of other features that were engineeredโ€”namely its use as a normalizing variable in the engineering of the price vs. median new variable (see below).

Drilling down further, the number of bedrooms were almost evenly distributed for the three interest classes. That said, it is important to note that low interest apartments are more heavily weighted towards one bedrooms while medium and high interest level listings were more likely to be two bedroom listings. However, when the number of bedrooms was included in various models using the raw variable or boolean by-product, it did not provide huge gains in terms of predictability. Nonetheless, the number of bedrooms proved to be a good ingredient for the creation of new features.

Bedrooms

Price: Unsurprisingly, one of the most important features in the dataset was the prices per listing. The price frequency distribution deviated from having a pure gaussian-shape featuring a remarkable skew on its right side due to the inherent structural supply deficit in the NYC apartment rental market. Moreover, the price histogram below showed a strong kurtosis due to the presence of several outliers on the right-side tail. Several transformation methods were tested with price per room and natural log price resulting as the best options to smooth the original raw variable and attain a more normally-distributed shape without losing explanatory power to discriminate between interest level classes.

Price

Photos: As the histogram below highlights, the average listing at RentHop displays most of the times a number of pictures ranging from 3 to 8. The "number of photos" variable was created out of the original data set and follows a fairly normal distribution. When breaking down by interest level, one of the ex ante null hypothesis has been validated. Listings with no pictures were more prone to belong to the "low interest" class. In fact, the probability that a listing is classified as "low" interest is higher for those with no pictures (95.2%) than for those with at least one (67.5%) image available.

Photos

Given the strong effect of photos,  a new variable named "no_photo" proved to be useful when added in linear models such as Logit, Linear Discrimination Analysis(LDA) or bayesian-based methods like Bernoulli Naive Bayes(BNB). On the other hand, "number of pictures" resulted in better results when applied to tree-based classification models like Random Forest or Gradient Boosting. The lesson here is that feature engineering is necessary to extract value from an original dataset; however, the newly featured variable marginal value added hinges on the type of model to be used.

Listing features: Another  useful raw variable was "features", which was eventually important for the team in terms of feature engineering and model performance. The word cloud chart below highlights the most frequent key words used in listing advertisements with "Doorman"or "Elevator" among the most important.  Three new variables were created using input from the "features" field:

  • Number of Features: it simply counts the number of features per listing.
  • Number of Key Features: it counts only key words considered as such according to the word cloud frequency importance.
  • Number of Key Features Score: EDA analysis showed that the number of features' importance on interest levels soars dramatically as the number of key features per listing rises above five; thus a scoring system was created using "number of key features" in order to maximize this threshold effect.

features1

5-in-1 new feature:  RentHop website information suggests that listing interest may be related to a variable that measures the price of an apartment relative to nearby apartments:

  1. For each listing page, RentHop compares the price of an apartment to median price of apartments in the same neighborhood with the same amount of bedrooms and bathrooms.
  2. RentHop has a map search that allows users to find an apartment using a map.

To develop a variable that captures this impact, the team decided to compare the price of an apartment to the median of the thirty nearest apartments that had the same number of bedrooms and bathrooms. Using the thirty nearest points, each apartment had its own unique neighborhood based on the listings around it. This avoids an issue of fixed neighborhoods where boundary line points - apartments only a few blocks ways - would not be compared. Since this calculation was technically expensive, the RentHop data was loaded into MongoDB and setup a geographic index to optimize performance. This feature offered valuable information not contained in other variables and it was one of the top features in the team's prediction models.

Model Engineering: Validation and Ensemble

Firstly, multiple predicting models were built, fine-tuned and tested using raw data variables and new significant features. In this first stage accuracy (bias) and precision (variance) were studied on a stand-alone basis. During the second stage the more powerful models from each type were combined using Python's Brew library, a comprehensive tool to ensemble and stack predicting models in order to enhance their stand-alone predicting ability. Regrettably, the assemble output performance was not as encouraging as expected initially with minimal gains in terms of accuracy and variance.

The table below summarized the best models per model family type and highlight how tree-based models proved to be far superior not only when run on training sets and during validation; but also during the acid test using the test data. K-Nearest Neighbor and Support Vector Machines were also tested but with disappointing results that forced us to excluded them from the group below and focus our efforts in more optimal models:

The most important models per family are discussed below in order to go through a more thorough description for each method:

Linear and Discriminant Analysis Models

After trial of maximum-likelihood models such as Logit and multiple linear discriminant models, the results led to consider LDA as the best linear modeling option. LDA seeks to find a linear combination of features that characterizes or separates two or more classes of objects or events. The model assumes that the predicting features follow a normal distribution while assuming different means per class for each one of the features but similar standard deviation and covariances.

RentHop training sample contains few significant variables with non-normal distribution characteristics like excessive kurtosis or significant skew. Price was one of the most significant variables in the models aforementioned, for which reason transformations of this variable enhanced, not only its predicting ability, but also smoothed its non-gaussian shape. Last but not least, variance and correlation metrics analysis yields positive conclusions as the inter-class equal variance and covariance constraint could be a sensible one for the chosen LDA predictors as the tables below showcase:

Correlpng

Another family of models that were analyzed were Multinominal Naive Bayes and BNB. For the former model a series of frequency-like new predictors were generated; whilst for testing BNB only boolean features like "no phone" were considered. Regrettably,the main conclusion - after reviewing the results from both linear and Naive Bayes models - was that their accuracy performance and generalization error for the problem at hand were below the performance of non-parametric tree models such as Random Forest or Boosting Gradient.

Tree Models: Random Forest, Gradient Boosting and Extreme Gradient Boosting

With respect to selecting models, tree-based ones proved to be the best choice for explaining and predicting RentHop interest levels.  They gave the flexibility needed to deal with the raw amounts of features of the original data set and were also able to handle the non linear decision boundaries needed. Python's sklearn library can be used to tune their parameters very easily and deal with the tremendous computational effort demanded by these sort of models.

Feature Importance according to Random Forest model: Price features lead the pack

 

Finally, the most important aspect of tree based models was that for the most part, they are still explainable in terms of how they can find answers to client questions. In this way, three tree-based models were effective for predicting our response while maintaining a competitive log-loss score: random forest, gradient boosting and extreme gradient boosting (XGBoost).

Unsurprisingly, XGBoost performed the best but there were still important reasons to be more favorable towards random forest. This is because with random forest, the output obtained was more easily interpretable and helpful in order to highlight what features were performing well, while it allowed us to maintain a high level of transparency and rigor during the parameter fine-tuning process. The accuracy was almost as good as the XGboost model but still able to be reproducible to answer any question coming from users without much technical knowledge.

Click here to check code in Github

About Authors

Carlos Salas Najera

Carlos is a passionate individual for investments and technology with a long/short equity analysis and portfolio management experience approach that combines his fundamental, quantitative and data science skills in order to deliver superior returns. His core strength lies...
View all posts by Carlos Salas Najera >

Tom Hunter

View all posts by Tom Hunter >

Drace Zhan

Drace Zhan has honed the bulk of his communication skills by teaching math and reading skills to high school and college graduates since 2007. A whiz at translating abstruse concepts to easily understood terms, he entered the field...
View all posts by Drace Zhan >

Jake Bialer

During the past eight years, Iโ€™ve worked as a full-stack developer, data analyst, and journalist. Iโ€™ve a track record of finding unique datasets through web scraping and using them to help companies solve key business problems. My NYCDSA...
View all posts by Jake Bialer >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application