NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Data Analysis and Predictions of Zillow Rental Index

Data Analysis and Predictions of Zillow Rental Index

Hong Chan Kim, Bruce Alphenaar and Steven Lantigua
Posted on Jan 7, 2021

Link to the Code

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data Science Objective

Two main objectives of this project are to determine the factors that influence the Zillow Rental Index (ZRI) and to utilize them to produce annual forecasts of the ZRI at the zip code level. The Zillow Observed Rent Index (ZORI) was used as a benchmark of accuracy, which measures changes in asking rents across the United States over time. In this text we will use data to analyze and predict Zillow Rental Index.

Data

The fist source of data is the Federal Housing Finance Agency. It provides annual personal consumption expenditures (inflation measure) and the house price index, a weighted repeat sales index. The second source is the U.S. Bureau of Labor statistics, which provides county level unemployment and job openings data.

American Community Survey

The American Community Survey from the U.S. Bureau of Labor Statistics provided the demographics data, as it covers it for every county in the United States. Zillow itself was the source of the home value index. The air quality index used came from the U.S. Environmental Protection Agency. Below is a summary of the data sources and the final list of features. 

  1. Federal Housing Finance Agency: Annual inflation and house price index
  2. U.S. Bureau of Labor Statistics: County level unemployment and job openings statistics
  3. U.S. Census Bureau - American Community Survey (ACS): County level demographic data
  4. U.S. Environmental Protection Agency: Monthly air quality index
  5. Zillow: Zillow home value index

Final List of Features

  • ZHVI (Zillow Home Value Index)
  • House Price Index (HPI)
  • Air Quality Index
  • Total Population
  • Population Density
  • Unemployment Rate
  • Education level (% Bachelors)
  • Construction Permit
  • Median Income
  • Total Households
  • Personal Consumption Expenditures (PCE)
  • Rental Vacancy Rate
  • Job Openings
  • Commute Time
  • Public vs Private Sector workers
  • Gross Rent as a Percentage of Household Income (GRAPI)
  • Gini Index (Income Inequality measure

- Feature Engineering -

Gross rent as a percentage of house income (GRAPI) is a feature that was engineered to enhance the model. It is a key indicator of housing affordability and serves as a measure for any future percent increases. Median gross rent was divided by median household income (after monthly interpolation) to create GRAPI.

Using Data to Analyze Model Fitting and ZORI Predictions

In order to predict ZORI, 3 types of machine learning models (multiple linear regression, gradient boosting regressor, and random forest regressor) were fitted on 2 labels, i.e., ZORI and percentage change in ZORI year over year. Averaging predictions from the 6 different models at a zipcode level, at a county level, and at the national level produced the ZORI predictions.

Multiple Linear Regression

Lasso penalization was used to eliminate less important features by grid-searching through various alpha hyperparameters and calculating feature coefficients for each alpha. VIFs were then calculated for remaining features to eliminate multicollinearity. Finally, a multiple linear regression was fitted using the remaining features. The remaining features were:

  • ZHVI
  • Construction permits
  • Total households
  • Rental vacancy rate
  • GRAPI
  • Population density

The final model produced a RMSE for the test set of $293.28. That is relatively high given that the average rent in the US is ~$2k. As a result, the prediction was also a bit off as shown below.

Data Analysis and Predictions of Zillow Rental Index

Similar process was applied to fit percentage change in ZORI year over year. The final features were more extensive with only the Gini index (income inequality measure) eliminated. Test RMSE was 2.51%. The prediction for the same zip code is shown below. The predicted ZORI values were calculated by multiplying ZORI from prior year by (1 + percentage change in ZORI year over year predicted from the model). 

Data Analysis and Predictions of Zillow Rental Index

Gradient Boosting Regressor

The model was tuned for various combinations of hyperparameters to optimize performance. The best hyperparameters are shown below: 

Data Analysis and Predictions of Zillow Rental Index

The model produced a much better test RMSE than the multiple linear regression model. The $17.31 test RMSE is less than 1% of average rent in the US, and the prediction is more stable as a result. 

Data Analysis and Predictions of Zillow Rental Index

The second gradient boosting regressor was fitted on percentage change in ZORI. The optimized hyperparameters were the same as above, and the test RMSE was 1.52%, much lower than that of the equivalent multiple linear regression model. 

Data Analysis and Predictions of Zillow Rental Index

Random Forest Regressor

Similar to gradient boosting, the random forest regressor was tuned with the best hyperparameters. The model produced a test RMSE of $57.14, which is between that of the first two model types. 

Data Analysis and Predictions of Zillow Rental Index

Prediction using this model is shown below:

Data Analysis and Predictions of Zillow Rental Index

Hyperparameters changed slightly when fitting the percentage change in ZORI using a random forest regressor. The model performance was in between the first two model types. Test RMSE was 1.97%. 

Prediction using this model is shown below:

As mentioned above, predictions were volatile based on the model choice. To smooth the result, predictions from all 6 models were averaged to produce the final prediction for each zip code. To remove geographical discrepancies, predictions were also made at a county level and at the national level. 

Time Gradient Models

Following the successful implementation of the fitting procedures described above three additional models that relied on the time gradient of the features and/or the target variable were explored. While preliminary, these models provide alternative ways of exploring the data that could eventually be useful to predict price changes. The models are as follows: 

  1. Snapshot of the zip code specific features to predict the linear slope of ZORI versus year
  2. The slope of the features versus year prior to 2018 to predict changes in ZORI from 2018 to 2019
  3. All zip code specific features to predict whether a particular zip code would see an increase or decrease in rental prices due to COVID. 

In the first model, the slope of ZORI versus year was determined for each zip code by linear regression. Random forest fitting was then used to predict the slope of ZORI using the fixed features from previous years. Once the slope was predicted, it could then be used to determine ZORI for subsequent years. Data were divided into training and test sets by zip code so that the test data was completely isolated from model training.

The slope predictions of the model were reasonably accurate, giving a coefficient of determination R2 = 0.77.  However, the ZORI predicted using the slopes had a fairly large mean error of 42%. This is most likely because the assumption that the ZORI increases linearly with time is typically incorrect. The model could be greatly improved by including higher order terms in the gradient of the time dependence. 

In the second model, the slope of each feature with respect to time was determined for each zip code by linear regression for the years 2014-2018. These slopes were then used in a random forest model with the 2019 ZORI as the target variable. Once again data were divided into training and test sets by zip code. The coefficient of determination R2 = 0.42 was worse than in the previous model, though the mean error was somewhat better at 39%.

Going forward, it would be good to combine the two models so that feature slopes (including higher order gradient terms) could be used to predict ZORI slopes. Time variations in select features should anticipate time variations in the ZORI. 

The third model predicted whether a particular zip code would see an increase or decrease in ZORI due to the COVID-19 pandemic. Changes in housing prices during 2020 were caused by a number of COVID related factors. Fear of COVID spreading in crowded areas led to movement out of densely populated urban areas to less crowded suburban or rural areas.

Remote work and school meant that it was no longer necessary to live within commuting distance of the city center, while families forced to work and care for children at home needed more living space. The impact of COVID on housing prices in different counties is illustrated by the scatter plot below where the area of each county is plotted as a function of population.

The points are colored according to the time gradient of ZORI per county in 2020: blue points for decreasing ZORI, orange points for increasing ZORI and green points for flat ZORI.  Areas that saw a decrease in ZORI are located predominantly in the right side of the plot where the population density is high, while areas that saw an increase in ZORI are located in the lower left corner where the population density is low.

This agrees with the fact that in 2020 more people moved out of the city and into the country than in other years due to COVID. 

Based on this preliminary analysis, two logistic models (logistic regression and random forest classifier) were developed to predict ZORI gradient due to COVID-19. The gradient of ZORI in each zip code was separated into two classes, 67% having a positive gradient and 33 % having a negative gradient. The coefficient of determination for the logistic regression model was R2 = 0.72 with an area under the ROC curve of 0.75, while the random forest classifier had R2 = 0.76 and AUC = 0.85. 

Interestingly, many of the important features for predicting the ZORI gradient were those that distinguished urban and rural areas.  These included population density, percent of population with a bachelorโ€™s degree (higher in urban areas), commuting time, and Gini index.  

Future Steps

Different models were explored for understanding the impact of various features on the ZIllow rental price index. Going forward, there are a number of areas that could be explored further:

  • Apply models to more frequently updated feature data to provide a timely prediction of price increases
  • Fit the errors of the model predictions to higher order terms in the gradient of the time dependence
  • Explore correlations between gentrification and rental price increase
  • Consider change of address data from USPS to determine rental price changes
  • Focus on select geographical areas for more accurate analysis

About Authors

Hong Chan Kim

Hong is a data science fellow at New York City Data Science Academy (NYCDSA) with expected graduation date of December 2020. His domain expertise lies in the US equity market, where he spent 7 years in the hedge...
View all posts by Hong Chan Kim >

Bruce Alphenaar

View all posts by Bruce Alphenaar >

Steven Lantigua

Steven Lantigua is a Data Science Fellow at NYC Data Science Academy and a recent graduate from the University of Connecticut. He hopes to leverage his background in research & advisory, where he spent the last year of...
View all posts by Steven Lantigua >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application