NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Data Prediction of Zillow Rentals for Phoenix and Tampa

Data Prediction of Zillow Rentals for Phoenix and Tampa

Yukti Kathuria, Chitra Sharathchandra, Gabriela Huelgas Morales, Juan R. Vasquez Jr. and Guillermo Ruiz
Posted on Apr 5, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Motivation

US renters paid $512.5 billion in rent in 2019, according to Zillowโ€™s data report. Leading the pack was Phoenix where the rental market had a growth rate of 7.6% and closely trailing was Tampa with a growth rate of 4%. 

Data Prediction of Zillow Rentals for Phoenix and Tampa

Figure 1: Zillow Press Release about US rental market

 

Goal

The goal of this project was to gain insights into the important factors that influence the rental values of homes and apartments in Phoenix and Tampa. This was done by exploring the Zillow Rental Index (ZRI) provided by 7Park Data and American Census Survey (ACS) data obtained from Google BigQuery. The ZRI was the target variable predicted based upon the ACS data.

Data Exploration 

Data Prediction of Zillow Rentals for Phoenix and Tampa

Figure 2: Average ZRI for Phoenix and Tampa from 2015 to 2017

 

When doing exploratory data analysis of the provided ZRI data, we found that in addition to having a much more competitive rental market than Phoenix, Tampa is comparably least affected by seasonality as well. An immediate insight drawn from Phoenixโ€™s sellers' market is that waiting about 4 months (Late Winter/Early Spring) will yield more of an ROI in rent properties investments. 

Data Prediction of Zillow Rentals for Phoenix and Tampa

Figure 3: ZRI Average Growth Rate for Phoenix and Tampa from 2015 to 2017

 

Seasonality can also be seen in the plot above displaying Phoenix and Tampa ZRI Average Growth rate from 2015 โ€“ 2017. In the early autumn months, we can observe a decrease in ZRI average growth rate as opposed to metrics seen in late winter. However, in order to get more of an idea of the granularity of the growth rate we have to delve into the zip code level.

Figure 4: ZRI of most populous zip code from Phoenix and Tampa

 

Looking at Phoenix and Tampaโ€™s most populous city, we found a consistent increase in ZRI  starting from 2015 to 2017. Of the two, Tampa is  the more competitive market. 

Figure 5: ZRI Growth Rate of most populous zip code from Phoenix and Tampa for 2015 to 2017

 

At a granular level, when observing the most populous zip codes of Phoenix and Tampa's average growth rate, Zip Code 85225 (located in Phoenix) showcased more of an increase in growth rate compared to its Tampa counterpart. Although both populous zip codes displayed an overall increase in growth rate throughout the years and seasonality trends, exploring factors such as the effect of the School Calendar on the rental index growth rate might be insightful to explain drops and spikes during Autumn and Late Winter/Spring.

After studying the target variable (ZRI), we also went on to compare the ACS features of both cities in the years 2015-2017.   With over 50 features to study, we focused on features that seemed to identify the uniqueness of each city.  

Ways Phoenix and Tampa are different

The first feature we looked at was the geographic location and income per capita. Located f off the coast of the Gulf of Mexico, Tampa offers many coastal areas.  Phoenix, on the other hand, is a landlocked city in the middle of the country.  The income per capita in Tampa is highest near coastal areas. In Phoenix, the income per capita is highest in the northeastern suburban area.  So, geographically both cities were significantly different, and income per capita was related to the geographic layout.

Figure 6: Phoenix Map

 

Figure 7: Tampa Map

 

Age of Population

The next feature we analyzed was the age of the population.  As seen in the graphs below, one can surmise that the Phoenix population is significantly younger than that of Tampa.  Tampa is a city for retirees and this is indicated by the higher number of people over 60.  It is surprising that in addition to having more people of age above 60, Tampa also has a significantly lower population of people under 18.  

Commute Time

The third feature we studied that showed that the cities were different was the commute time.  The data showed that the commute time in Tampa was significantly higher than in Phoenix.  Upon investigating further, we found that Tampa had a lot of traffic due to the combination of poor infrastructure and lack of good public transport. Most people drove and did not carpool.  Phoenix has short commute times because of superior infrastructure and many alternative routes to destinations and diversity in employment centers.

Ways Phoenix and Tampa are the same

There were three ACS features that the cities had in common r.  All were related to housing.  The first feature was the ratio of single-family to multi-family dwellings. The ratio in both cities is very similar, as shown in the graph below.  The second feature was the ratio of vacant dwellings. This ratio was also similar in both similar.

Rent Burden

The third feature that both cities had in common was rent burden.  The graph below shows that the majority of renters in both cities had a rent burden of 10%-50%

Data Workflow 

Overview of workflow 

We wanted to work with reliable data. For this reason, we decided to keep only those observations for which we had at least 70% of the data. We also used 70% as our minimum for the values on features used in the observations. For the few observations that had only a few missing values, we imputed the missing values using two strategies: i) forward filling, which uses the previous value to fill the present missing one, and ii) median value imputation. We dropped all the observations for which the label, or dependent variable, was missing.

Data on Feature Engineering

Grouping

For feature engineering, our goal was to take the features we gathered from the ACS data and group several of them together. The feature (female_under_18) variable captures information of several individual features such as females_under_5, females_5_9, and others. These were then grouped further into an age feature.

We also grouped several other features, including, rent burden, commute types (public or private), income brackets and size of dwellings, etc. Redundant features. such as additional geoid which was a blank field and married households, whose information was already captured by family households, were removed. Additional features such as the water to land ratio in each zip code of the two cities were added. This analysis leads us from over 240 features to just 59 of the most relevant features for the development of our models.

New features

We reasoned that access to water would have an impact on rental values, especially for Tampa. Thus, we used publicly available geographical data (geo US boundaries) to engineer a new feature that represented the water to land ratio of each Zip Code.

VIF analysis (discarded features)

After grouping the features, we still found a very high level of multicollinearity among them as seen in the below VIF graph:

As seen in the above graph, VIF levels far exceeded the threshold of 5 to 10. In order the reduce the VIF, we removed the following features: 

  1. married_households
  2. owner_occupied_housing_units_median_value
  3. owner_occupied_housing_units_lower_value_quartile
  4. less_than_college_educated d
  5. amerindian_including_hispanic. 

After removing the features listed above, the VIF improved significantly with almost all groupings being below 9.

Cluster Analysis

We tried to analyze multicollinearity within the grouped features using cluster analysis but found this to be redundant with the VIF analysis.

Data Modeling

We started by building a battery of linear, ridge and lasso, and tree-based, random forest and gradient boosting models. The dependent variable was the ZRI values for December 2017. As predictors, we included the following:

  1. Variables obtained from the census data.
  2. ZRI values for December 2015. That included, historical data from two years before the dependent variable we are trying to predict.

These models are implemented for the cities of Phoenix and Tampa separately. The quality of the predictions by our models will be tested against unseen data. The metric for doing this will be the root mean squared error (hereinafter โ€˜RMSEโ€™). The tree-based models provide the better predictions. Our gradient boosting model allows for predictions that deviate only 0.03% from the actual values we are trying to predict.

Motivated by the high accuracy of our models, we move on to evaluate how well they generalize to other American cities. If they perform well even in different cities, we may have found a model that can provide important information on cities across the country.

Unfortunately, the accuracy of the models drops significantly when applied to different cities. For example, our gradient boosting model for Phoenix predicts values for Tampa that diverge 12.97% (RMSE) from the true ZRI values. What could be the reason for this? This loss of accuracy could be explained if customers looking to rent in each city would value real-estate properties differently. In order to test this hypothesis, we sought to identify which particular variables influence our models the most in each city: 

Figure 8: Most Important Features according to the models

As we can see in the figure above, there are substantial differences between the variables that better predict rental prices in each city. Median age appears as one of the best predictors in Tampa, though it doesnโ€™t come up among the five most important ones in Phoenix. The fact that Tampa is a popular destination for retirees probably explains this difference.

We can also see how geographic factors are important. Tampa is a coastal city, and the water to land ratio appears as the third important factor determining the rental price. When a property is located near the coast, the cost of renting such property increases. Of course, this variable is non-relevant in Phoenix, a continental city. 

The next step in our research was to determine to what extent is historical data relevant in predicting ZRI values. In this regard, models were fitted using  census and historical data separately. We found that census data can predict with high accuracy, only 1.9% (RMSE) deviation from the actual ZRI values.

However, predictions from historical data alone provide even better results, similar to the ones obtained from our first models. At this point, we questioned whether it would be more efficient to predict using only historical data, as there would be no need to collect census data  In our next section, we explore how effective this strategy would be.

Residual analysis

Our previous models highlight the power of the historical rental values to predict future ones. As relevant as that might be, it is not very informative. To get more insightful results we decided to use a different approach: we modeled the residuals. This was a two-steps approach:

  1. We used a simple linear regression of 2016 ZRI values on 2018 ZRI values. This model, which we call the historical model, explained 96% of the variance in Phoenix and 94% in Tampa. Then we used the full dataset to re-train the historical model and then used it to make predictions. Finally, we calculated the residuals by subtracting the predictions from the true values.
  2. We used the residuals as the dependent variable and ACS features to predict them. Since there were no obvious linear relations between the residuals and the features, we used tree-based models.

The residuals from the historical model vary in magnitude and can be positive or negative (Figure 9). Negative residuals indicate that the historical model is overpredicting the ZRI in that specific Zip Code. On the contrary, positive residuals are those Zip Codes for which the historical model underpredicts the ZRI.

After cross-validation and parameter tuning, the Gradient Boosting model was superior to Random Forest on predicting the residuals. Our best model explained 57% of the residualsโ€™ variance in Phoenix and 64% of it in Tampa. The most important features for predicting rental values, in addition to historical data, were different for the two cities that we analyzed (Figure 8). 

Figure 9: Variations from the historical data

 

Insights

Our analysis on the Zillow Rental Index resulted in actionable insights. First of all, we backed our intuitive observations on the importance of historical rental values to predict future ones. If you were to rely only on one feature to predict future rental values, that should be the current value without hesitation. However, those areas in which rental prices deviate from the historical ones (i.e. those that have a large residual from the historical model) proved to be very informative as to what other features impact rental values. 

We visualized the residuals and the most important features of each city into maps to aid our understanding of their relationships. In both cities, we observed that most Zip Codes were accurately predicted by the historical model (pink), as well as a few examples where the ZRI values were underpredicted (yellow) or overpredicted (purple). Surprisingly, Zip Codes where ZRI is overpredicted by the historical model to be the ones with the highest income per capita in both cities (Figures 11 and 12).

This observation uncovers an innate weakness of the ZRI as a metric because it fails to accurately describe those sectors of the dataset. In fact,  Zillow has updated its metric and replaced the ZRI with ZORI to better capture the US rental market.

Zipcodes

Our analysis highlights those Zip Codes that are good investment opportunities. The Zip Codes that appear in our residuals maps in yellow are places where people are paying more in rent than expected. Thus, we recommend that investors take a good look at the possibility of acquiring properties there. Interestingly, in Phoenix, it seems that the low availability of vacant houses for sale is partially driving this deviation from the historical predictions (Figure 10).

Figure 10: Residual Analysis for most important features for Phoenix-models

 

In a specific region southwest of Tampa, a group of overpredicted Zip Codes is also the one with the highest median age and water to land ratio (Figure 11). 

Most of these overpredicted Zip Codes coincide with the highest income per capita in both cities. However, according to our analysis,  on average, people are not willing to pay those higher rates.  Consequently, investors in  those Zip Codes that are overpredicted by the historical model (purple color in the Residuals maps) that have the historical rental values as a target for properties in these areas should prepare for extended vacancy periods

Figure 11: Residual Analysis for most important features for Tampa

 

Future Work

In order to take this project, there are a lot of directions we could take. One immediate future step we could take is to explore more cities to validate our models, as well as develop representative models for the cities and test them on a more granular level to generate insights and compare them to other cities.

In addition, it would be interesting to incorporate other data sources to improve the fidelity of our model. Sources such as crime statistics and migration data have been considered to refine our models. Furthermore, as was discussed in the insights section previously, we would like to explore the new Zillow index, ZORI which is an improvement of ZRI, and see how our models change.

We also want to study some features more deeply. As was discussed in the feature engineering section, several features were grouped together to better describe the data as well as reduce multicollinearity but their effect is not clearly visible. So we want to study them in more depth to see what would be a better way to store that information in our models to improve the performance.

Moreover, we would like to perform a time series analysis and investigate the features that affect the seasonality of the rental index. Finally, we want to explore how our models could change to account for the impact of COVID-19 on the rental market. Finally, we could create an interactive app in order to better display our data insights.

GitHub

About Authors

Yukti Kathuria

Yukti holds a B.S. and M.S. in Aerospace Engineering from the University of Illinois at Urbana-Champaign, and is extremely passionate about problem-solving. She finds data visualization to be one of the most interesting and insightful tools to understand...
View all posts by Yukti Kathuria >

Chitra Sharathchandra

Chitra Sharathchandra is a software engineer who is passionate about technology. Her current focus is on data science and data engineering. Chitra enjoys teaching South Indian classical music.
View all posts by Chitra Sharathchandra >

Gabriela Huelgas Morales

I am a Data Scientist with a Ph.D. in Biomedical Sciences. I enjoy the challenges of solving complex problems, finding meaningful relationships within the data, and providing actionable recommendations and insights. Before joining NYCDSA, I was a scientist...
View all posts by Gabriela Huelgas Morales >

Juan R. Vasquez Jr.

Juan is a recent graduate of NYC Data Science Academy where he studied dashboard creation, machine learning, and statistical analysis. His background of three years in the hospitality and commercial art industry allowed him to hone his organization...
View all posts by Juan R. Vasquez Jr. >

Guillermo Ruiz

Data Science Professional and Economist with a demonstrated history of data analysis and machine learning modeling with a focus on storytelling with data. Passionate about helping companies to gather and analyze data to make more informed decisions to...
View all posts by Guillermo Ruiz >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application