Data Driven Supervised Models Forecasting U.S. Rent Prices

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

60 Second Summary

  • Supervised ML models were applied to accurately forecast monthly rent prices in 2019 for 1000+ zip codes across the U.S., with an average error under $100/month. 
  • While historical rent prices are the strongest predictor for future rent prices, alternative data sources provide value in capturing unique trends that improve rent price forecasts. 
  • Various methods of data cleaning, imputation, merging based on data availabilities in the benchmark year, and feature engineering with time lags were employed in order to integrate time series data from the Zillow Rent Index and several alternative data sources in a format appropriate for supervised learning.

Data Science Background

Whether you are a renter looking for a new place to live or an investor looking to buy property, in a competitive market, it is important to be able to identify the top locations to invest in and understand how rent prices change over time. Increasingly, the fields of Real Estate and Finance are using alternative data to predict investment returns and drive more informed decision-making. Broadly defined, alternative data is data from sources not traditionally used in or related to a given vertical, such as:

  • Sentiment analysis from social media data 
  • Public/government data sources
  • Satellite imaging
  • Data scraped from the internet

The corporate partner for this project is Markerr, a commercial real estate intelligence platform aiming to help investors make data-driven decisions about a property’s potential. Their tools rely on alternative data on “local demographics, employment, spending and other critical drivers of rent” to model geospatial trends in real estate.

Data Science Objectives

This project has 2 major goals:

1) National Rental Price Forecasting

  • Develop accurate time series models of U.S. rental prices by zip code

2) Alternative Data Predictor Identification

  • Find or engineer unique features from external data sources
  • Assess the predictive value of each data source and identify the most insightful features for potential inclusion in Markerr’s internal models

Data Collection, Cleaning & Feature Engineering

Data was gathered from 6 major sources, varying in geographical and temporal stratification:

Forecasting U.S. Rent Prices with Alternative Data

Forecasting U.S. Rent Prices with Alternative Data

1) Zillow Rent Index (ZRI)

The monthly Zillow Rent Index was selected as the target variable. ZRI data for multifamily (i.e. multiple unit) rental properties in 1,861 zip codes across the US was obtained from Sept 2010 to Jan 2020. On average, each zip code was missing ZRI values for 42 of the total 113 months included in this dataset. 

Missing values were handled as follows :

  • Exclude data prior to Jan 2015 due to high missingness
  • Exclude any zip code with a gap of 6+ consecutive months of missing values
  • Apply linear imputation to fill gaps of <=5 consecutive months of missing values 

These cleaning procedures filtered the dataset down to 1,301 zip codes.

2) American Community Survey (ACS)

The US Census Bureau contacts 3.5 millions households each year to collect information regarding housing, social, demographic and economic characteristics with the aim of equipping community planners with data. In this analysis, the ACS 5-year estimate datasets, released in the years  2014, 2015, 2016 and 2017, were obtained. They had 252 features originally that  were eventually reduced to 42 features after data cleaning and feature engineering. 

Examples of ACS data cleaning & feature engineering:

  • Relabeled highly specific categories into broader bins. For example, the original dataset had over 40 features describing gender_age groups (such as ‘female_15_17_years’). These were combined into 4 age groups: 0-17, 18-39, 40-64, and 65+ years. This transformation applied to variables of commute time and number of dwellings as well.
  • Normalized raw counts into percentages. Raw aggregates (number of Asians, number of bachelor degree holders) were normalized by dividing by the relevant total population value in the given zip code.
  • Created ratios between relevant features. For example, a new feature was created to capture rental demand and vacancy by dividing the number of renter occupied housing units by the total number of housing units.
  • Dropped features sharing similar or redundant information. Features such as ‘bachelors_degree’, ‘bachelors_degree_2’ and ‘bachelors_degree_or_higher’ were redundant.
  • Imputed values for features with minimal missing values. For example, missing values of the median building age feature were linearly imputed based on adjacent years.

3) IRS Income Tax Returns

The IRS extracts tax and income data with features such as wages and salaries, number of returns, number of personal exemptions, adjusted gross income and interest received from individual income tax returns filed every year. Data is available for tax years as early as 1998 through 2019. From this dataset, a feature to compare local tax rates at the zip code level to the federal tax rates was created: 

  • % state and local income tax = (state + local) / (state + local + federal). 

4) Google Data Trends

Google Trends were queried using the pytrends library. Given the diversity of topics and keywords in Google’s search database, and the flexibility in search term retrieval enabled by the API, 5-15 terms related to each of the following subject areas were queried:

  • Housing rental markets
  • Jobs, employment, and retirement
  • Social media and pop culture (i.e. ‘fake news’ or ‘Twitter’)
  • Gentrification 
  • Socio-political leanings
  • LGBTQ-related
  • Natural disasters
  • Travel
  • Guns

For each search term, the monthly averages of its Google search popularity in each U.S. metropolitan area from 2015-2019 were obtained. As the Google-defined metropolitan areas differ slightly from those in the Zillow dataset, a mapping between the two sets by metropolitan area was created, allowing merging of Trends data with all other data sources. All constituent zip codes within a single Zillow-defined metropolitan area received the same Google search popularity time-series per search term.

A note on data normalization: Since Google Trends reports relative popularity (left) between each of the 5 search terms in a single query, it was important to normalize each search term by the term’s maximum popularity (right) over the time period of Jan 2015 - Dec 2019, per metro area.

Forecasting U.S. Rent Prices with Alternative Data

Example of the Google Trends queries related to rental markets.

5) Bike Sharing Services

A recent study by the National Association of Realtors and the American Public Transportation Association found that median home price increases were 4-24% points higher in areas within a half-mile of public transit services, compared to those with few public transit options, between 2012 and 2016. In the past decade throughout the United States, bike-sharing systems have gained popularity as a convenient mode of public transit. Thus, in order to measure the effects of accessible transportation and easier work commute on rental prices, features from a national bike-sharing dataset were incorporated. This data source provided information on 6,975 docked bike stations, 127 different bike-sharing systems, and 389 zip codes across 44 states. 

Forecasting U.S. Rent Prices with Alternative Data

Data cleaning involved fixing for misaligned columns, appropriately assigning locations to zip codes that were ambiguous (in more than one state), and repeating yearly values at the monthly level in order to merge with the other datasets. 

New Engineered features

Three new features (aggregated by zip code and month) were engineered:

  1. The total number of bikeshare stations (sum of docked station locations)
  2. The total number of bikeshare systems (count of brands)
  3. Has bike-sharing programs (a binary flag indicating presence of bike-sharing)

6) Business Dynamics Statistics (BDS)

The U.S. Census Bureau releases an annual Business Dynamics Statistics dataset, which includes measures of job creation and destruction associated with the number of businesses beginning, ending or continuing operations in each county. This is a comprehensive dataset covering all private sector, non-farm U.S. businesses. 

Four features (aggregated by county and linearly imputed to the monthly level) were engineered:

  1. The total number of firms
  2. The job creation rate 
  3. The job destruction rate 
  4. The total number of startup firms (with a firm age of 0 years)

Data Merging & Data Availability

The 6 data subsets were merged after converting features to the correct geographic (zip code) and temporal (monthly) aggregations. 

  • Google Trends and BDS features were mapped from metro area and county, respectively, to the zip code level, with all constituent zip codes receiving the same values. 
  • ACS, IRS, Bikeshare and BDS annually reported were converted to monthly features.
  • When appropriate, linear imputation was applied (assuming values are associated with the month of December of each year, and used linear interpolation for months between consecutive Decembers). Otherwise, constant values were repeated for each month in a year. 

When data merging, it was crucial to account for the data availability at the time of prediction, because every data source was published at a different lag from its time of collection. Therefore, this analysis was conducted under the premise that the study is taking place in December 2018, with the goal of making monthly forecasts in the next year, 2019. Thus, a temporal train-test split of the data was employed: training data from Jan 2015 to Dec 2018 and test data from Jan 2019 to Dec 2019. The figure below describes the temporal alignment of the different data sources:

Forecasting U.S. Rent Prices with Alternative Data

Time periods over which the datasets were collected are represented by blue bars, and timepoints at which the datasets were published are represented by purple triangles. For example, the most recent ACS data available in Dec 2018 was collected over 2013-2017 and published in late 2018. In contrast, Google Trends data is collected and published continuously, so Trends data as recent as Dec 2018 was used in models to forecast 2019. “Economic development data” refers to the income tax and BDS datasets.

Feature Selection

After the 6 data sources were merged, the final dataset comprised 1,301 zip codes with hundreds of features. In order to reduce the multicollinearity between independent variables, the variance inflation factor was computed for each set of features per data source. Features with a variance inflation factor greater than 5 were removed. 

Next, a Lasso regression model was applied to each set of features individually and with each iteration, features with coefficient values of 0 were removed. This resulted in reducing the original exogenous feature space of 200 features to 50 features (17 ACS features, 1 Bikeshare feature, 3 Economic development features and 29 Google Trends features).

Model Selection & Tuning

Classical Time Series Forecasting: ARIMA(X)

  1. ARIMA(X) was selected as a baseline model since it is the classical approach to time series analysis using autoregression. One known disadvantage of this approach is that a single model would have to be fit for each zip code. Another is that it is unable to penalize model complexity and brings little value in feature selection. 

Setup & Tuning: A separate ARIMA(X) model was fitted to each zip code’s ZRI time series (and exogenous features in the case of ARIMAX) from Jan 2015 to Dec 2019. Optimal orders were selected by grid search, with order (p=1, d=1, q=0) found to be optimal order for a large majority (>95%) of zip codes. Seasonal orders were investigated but were found to be unnecessary for a large majority of zip codes and were thus excluded.

In the order notation (p, d, q), p represents the number of lag terms, d represents the number of differences applied to the time series, and q represents the order of the moving average term.

Supervised Machine Learning: Lasso Regression & XGBoost

  1. Lasso was selected for its advantage as an interpretable linear model that provides feature importances. It’s advantageous for avoiding overfitting by removing less informative features.
  2. XGBoost was selected for its strength in capturing nonlinear relationships. It also offers feature importances as an output.

Setup & Tuning: The Zillow rent index, ZRI, variable was first logged and then standardized for each zip code using StandardScaler. Next, all training data was standardized for each zip code individually, using StandardScaler. For the training set, further feature engineering included exploring various time lags, inspired by the ARIMA model, ranging from 1 to 12 months back, moving averages, and moving standard deviations of the autoregressive (ZRI) and alternative data features.

Inclusion of these lagged features allowed adaptation of these supervised machine learning models to time series forecasting. 

Both Lasso and XGBoost models were tuned using grid search with 5-fold cross-validation over the training period. Inverse transforms were then applied to the predicted and actual ZRI values (exponential function, and inverse scaling) to allow for the calculation of residuals.

Modeling Approaches & Feature Sets

In order to compare forecasting accuracy and feature importances, the models were trained on 4 different datasets, based on 3 types of modeling approaches, described below: 

1) ZRI Autoregression (AR) only : Autoregression is a common technique in time series modeling and consists of forecasting future values of a target variable based purely on its past values. While simple, this method often yields high accuracy.

  • Feature set #1: AR features (past ZRI values) alone

2) AR + some exogenous features: Past values of ZRI itself and past values of any other features are used to forecast future ZRI.

  • Feature set #2: AR + ACS features to include general demographic information
  • Feature set #3: AR + all exogenous features from all 5 alternative sources 

3) Only exogenous features, no AR : Past values of the target variable are excluded from the model, and future ZRI is forecasted using only the past values of features from the other 5 datasets.

  • Feature set #4: Exogenous features + no AR  

Model Comparison: Forecasting National Rental Prices 

Test set RMSE, in dollars/month, per model type and modeling approach/feature set:

The ARIMAX is not suitable for modeling all zip codes simultaneously. Because a single ARIMAX model was individually fit to each of the 1,306 zip codes, serious issues of overfitting were observed, especially when more exogenous features were added. Due to overfitting, the ARIMAX model’s accuracy quickly deteriorated.

The Lasso and XGBoost models successfully forecasted rent prices with an average error under $100/month. XGBoost gave the lowest and most stable root mean squared error, around $50/month, for the following feature sets: AR only, AR + ACS features, and AR + all features.

Including autoregression was instrumental for the XGBoost model’s accurate price prediction - the exclusion of ZRI autoregressors increased RMSE by $30.40/month. On the other hand, the Lasso model slightly outperformed its autoregression score when trained on all of the exogenous features alone (a slight difference of $1.20/month).

Data Source Evaluation: Discovery of Predictive Features

The figure above shows the absolute value of coefficients for the top 15 exogenous features in the Lasso regression model (left) and the top 15 exogenous feature importances in the XGBoost model (right). Some key takeaways:

1) Google Trends-derived features made up the majority of the top features with greatest relevance in predicting the output variable, rent prices per zip code and month. These features outperformed more traditional data sources, like the ACS with demographic, education, employment and infrastructure information, in summarizing an area’s rental market.

  • “Unemployment,” ”Hurricane,” “Twitter” and “Black Lives Matter” Google searches were among the most relevant features in both Lasso and XGBoost.
  • “Percent Workforce Unemployed” and “Median Building Age” were the only 2 ACS features that were highly relevant in both models.

2) Economic features from the alternative datasets, such as “State+Local Income Tax Percent” and “Number of Startup Firms,” were more relevant features for predicting rental prices than the ACS-derived economic features (including “Income per Capita” and “Poverty Rate”).

3) Similarly, the number of “Bike Share Stations” feature was a top feature in both models, and outperformed the transportation-related ACS feature, “Percent Commute Public Transport.” 

All in all, the exogenous features derived from the alternative data sources provided additional predictive power to both the Lasso and XGBoost models.

Data Insights

In the above feature importances analysis, unemployment data features from two sources, Google Trends and ACS, were both ranked very highly within the top 15 features. To see how these features from the two datasets relate to each other, we investigated the correlation over time between ACS unemployment rates and search popularity for “Unemployment” and “Job Opportunities’”was calculated by zip code.

Both distributions (above) are approximately normal, centered at moderate to high correlation, with left skew from a small number of zip codes with negative correlations. This suggests that the information from the two sources is complementary and that Google Trends data might be a possible supplement or substitute for traditional data sources. Google Trends is able to provide the most recent information for known indicators, while traditional sources are not published immediately. In this study, the ACS and Google Trends features related to unemployment rates were strongly predictive of rental prices for several of the models.

While ACS unemployment rates are 5-year estimates and are only available ~one year after the data was collected, the Google Trends data on “Unemployment” and ”Job Opportunities” search popularities were available on a continuous basis, which is potentially hugely beneficial depending on the forecasting task at hand.

Model performance varied by population density. Shown above are average RMSEs for XGBoost for rural, suburban, and urban zip codes. Rural, suburban, and urban were defined following the US Census guidelines of 0-500, 500-1000, and 1000+ persons per square mile, respectively. The relative performance of XGBoost models trained with different feature sets was consistent across population density groups, but urban zip codes had ~25% higher error. This suggests that important factors in urban rental markets, such as income inequality, may not be fully accounted for by this analysis.

Data Science Conclusion

Ultimately, both the Lasso and XGBoost models were successfully deployed to forecast rental prices in 2019 with mean error below $100/month. Autoregressive inputs were crucial to high performance in the XGBoost model, and it performed best on the autoregressive + ACS feature set. The Lasso model had stable performance across feature sets, achieving the highest accuracy when trained on the autoregressive + all exogenous features set. ARIMA(X), while providing good accuracy with autoregressive data, was not suitable for a single generalized model of nationwide rent prices or for the predictor identification objectives of the project.

The client appreciated the unconventional approach and exploration of alternative data sources beyond broad demographic datasets like the American Community Survey. As described above, the Google Trends features, in particular, ranked highly in feature importance.

Future Directions

Future directions for this project include exploration of Google Trends data as a replacement or complement for other traditional data sources. For example, trends in the growth of bike sharing services might be tightly correlated with searches for “bike station near me." Appropriate Google search term popularities have the potential to provide information that is similar to features from a wide range of data sources, with the added advantage of highly recent data availability. 

Thank you for reading about our work! Please check out our author pages if you would like to take a look at our other projects.

About Authors

Niki Agrawal

Niki is a data science professional with 4+ years of data analysis experience in industry (digital health tech) and computational research (neuroscience, biomedical engineering). Niki enjoys applying creative and analytical thinking to solve real world problems with data....
View all posts by Niki Agrawal >

Moritz Becker

Strategy Consultant, with a passion for creating impact from data-driven business insights. Originally from Germany, I have been working in the US as an Engagement Manager in Strategy Consulting for over 3 years. My projects at work focus...
View all posts by Moritz Becker >

Matthew Fay

Matthew is a Data Science Fellow with a BS in chemistry from UNC-Chapel Hill. After 3 years studying towards a dual MD-PhD and researching antibody engineering, he pivoted to pursue data science and analytics. He has a passion...
View all posts by Matthew Fay >


Hao-Wei is an NYC Data Science Acadamy Fellow with master's degrees in Communication Engineering and Mathematics from National Taiwan University, and a Ph. D. degree in Mathematics from the Pennsylvania State University. With a broad experience ranging from...
View all posts by Hao-Wei >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI