Data Driven Supervised Models Forecasting U.S. Rent Prices
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
60 Second Summary
- Supervised ML models were applied to accurately forecast monthly rent prices in 2019 for 1000+ zip codes across the U.S., with an average error under $100/month.
- While historical rent prices are the strongest predictor for future rent prices, alternative data sources provide value in capturing unique trends that improve rent price forecasts.
- Various methods of data cleaning, imputation, merging based on data availabilities in the benchmark year, and feature engineering with time lags were employed in order to integrate time series data from the Zillow Rent Index and several alternative data sources in a format appropriate for supervised learning.
Data Science Background
Whether you are a renter looking for a new place to live or an investor looking to buy property, in a competitive market, it is important to be able to identify the top locations to invest in and understand how rent prices change over time. Increasingly, the fields of Real Estate and Finance are using alternative data to predict investment returns and drive more informed decision-making. Broadly defined, alternative data is data from sources not traditionally used in or related to a given vertical, such as:
- Sentiment analysis from social media data
- Public/government data sources
- Satellite imaging
- Data scraped from the internet
The corporate partner for this project is Markerr, a commercial real estate intelligence platform aiming to help investors make data-driven decisions about a property’s potential. Their tools rely on alternative data on “local demographics, employment, spending and other critical drivers of rent” to model geospatial trends in real estate.
Data Science Objectives
This project has 2 major goals:
1) National Rental Price Forecasting
- Develop accurate time series models of U.S. rental prices by zip code
2) Alternative Data Predictor Identification
- Find or engineer unique features from external data sources
- Assess the predictive value of each data source and identify the most insightful features for potential inclusion in Markerr’s internal models
Data Collection, Cleaning & Feature Engineering
Data was gathered from 6 major sources, varying in geographical and temporal stratification:
1) Zillow Rent Index (ZRI)
The monthly Zillow Rent Index was selected as the target variable. ZRI data for multifamily (i.e. multiple unit) rental properties in 1,861 zip codes across the US was obtained from Sept 2010 to Jan 2020. On average, each zip code was missing ZRI values for 42 of the total 113 months included in this dataset.
Missing values were handled as follows :
- Exclude data prior to Jan 2015 due to high missingness
- Exclude any zip code with a gap of 6+ consecutive months of missing values
- Apply linear imputation to fill gaps of <=5 consecutive months of missing values
These cleaning procedures filtered the dataset down to 1,301 zip codes.
2) American Community Survey (ACS)
The US Census Bureau contacts 3.5 millions households each year to collect information regarding housing, social, demographic and economic characteristics with the aim of equipping community planners with data. In this analysis, the ACS 5-year estimate datasets, released in the years 2014, 2015, 2016 and 2017, were obtained. They had 252 features originally that were eventually reduced to 42 features after data cleaning and feature engineering.
Examples of ACS data cleaning & feature engineering:
- Relabeled highly specific categories into broader bins. For example, the original dataset had over 40 features describing gender_age groups (such as ‘female_15_17_years’). These were combined into 4 age groups: 0-17, 18-39, 40-64, and 65+ years. This transformation applied to variables of commute time and number of dwellings as well.
- Normalized raw counts into percentages. Raw aggregates (number of Asians, number of bachelor degree holders) were normalized by dividing by the relevant total population value in the given zip code.
- Created ratios between relevant features. For example, a new feature was created to capture rental demand and vacancy by dividing the number of renter occupied housing units by the total number of housing units.
- Dropped features sharing similar or redundant information. Features such as ‘bachelors_degree’, ‘bachelors_degree_2’ and ‘bachelors_degree_or_higher’ were redundant.
- Imputed values for features with minimal missing values. For example, missing values of the median building age feature were linearly imputed based on adjacent years.
3) IRS Income Tax Returns
The IRS extracts tax and income data with features such as wages and salaries, number of returns, number of personal exemptions, adjusted gross income and interest received from individual income tax returns filed every year. Data is available for tax years as early as 1998 through 2019. From this dataset, a feature to compare local tax rates at the zip code level to the federal tax rates was created:
- % state and local income tax = (state + local) / (state + local + federal).
4) Google Data Trends
Google Trends were queried using the pytrends library. Given the diversity of topics and keywords in Google’s search database, and the flexibility in search term retrieval enabled by the API, 5-15 terms related to each of the following subject areas were queried:
- Housing rental markets
- Jobs, employment, and retirement
- Social media and pop culture (i.e. ‘fake news’ or ‘Twitter’)
- Socio-political leanings
- Natural disasters
For each search term, the monthly averages of its Google search popularity in each U.S. metropolitan area from 2015-2019 were obtained. As the Google-defined metropolitan areas differ slightly from those in the Zillow dataset, a mapping between the two sets by metropolitan area was created, allowing merging of Trends data with all other data sources. All constituent zip codes within a single Zillow-defined metropolitan area received the same Google search popularity time-series per search term.
A note on data normalization: Since Google Trends reports relative popularity (left) between each of the 5 search terms in a single query, it was important to normalize each search term by the term’s maximum popularity (right) over the time period of Jan 2015 - Dec 2019, per metro area.
5) Bike Sharing Services
A recent study by the National Association of Realtors and the American Public Transportation Association found that median home price increases were 4-24% points higher in areas within a half-mile of public transit services, compared to those with few public transit options, between 2012 and 2016. In the past decade throughout the United States, bike-sharing systems have gained popularity as a convenient mode of public transit. Thus, in order to measure the effects of accessible transportation and easier work commute on rental prices, features from a national bike-sharing dataset were incorporated. This data source provided information on 6,975 docked bike stations, 127 different bike-sharing systems, and 389 zip codes across 44 states.
Data cleaning involved fixing for misaligned columns, appropriately assigning locations to zip codes that were ambiguous (in more than one state), and repeating yearly values at the monthly level in order to merge with the other datasets.
New Engineered features
Three new features (aggregated by zip code and month) were engineered:
- The total number of bikeshare stations (sum of docked station locations)
- The total number of bikeshare systems (count of brands)
- Has bike-sharing programs (a binary flag indicating presence of bike-sharing)
6) Business Dynamics Statistics (BDS)
The U.S. Census Bureau releases an annual Business Dynamics Statistics dataset, which includes measures of job creation and destruction associated with the number of businesses beginning, ending or continuing operations in each county. This is a comprehensive dataset covering all private sector, non-farm U.S. businesses.
Four features (aggregated by county and linearly imputed to the monthly level) were engineered:
- The total number of firms
- The job creation rate
- The job destruction rate
- The total number of startup firms (with a firm age of 0 years)
Data Merging & Data Availability
The 6 data subsets were merged after converting features to the correct geographic (zip code) and temporal (monthly) aggregations.
- Google Trends and BDS features were mapped from metro area and county, respectively, to the zip code level, with all constituent zip codes receiving the same values.
- ACS, IRS, Bikeshare and BDS annually reported were converted to monthly features.
- When appropriate, linear imputation was applied (assuming values are associated with the month of December of each year, and used linear interpolation for months between consecutive Decembers). Otherwise, constant values were repeated for each month in a year.
When data merging, it was crucial to account for the data availability at the time of prediction, because every data source was published at a different lag from its time of collection. Therefore, this analysis was conducted under the premise that the study is taking place in December 2018, with the goal of making monthly forecasts in the next year, 2019. Thus, a temporal train-test split of the data was employed: training data from Jan 2015 to Dec 2018 and test data from Jan 2019 to Dec 2019. The figure below describes the temporal alignment of the different data sources:
Time periods over which the datasets were collected are represented by blue bars, and timepoints at which the datasets were published are represented by purple triangles. For example, the most recent ACS data available in Dec 2018 was collected over 2013-2017 and published in late 2018. In contrast, Google Trends data is collected and published continuously, so Trends data as recent as Dec 2018 was used in models to forecast 2019. “Economic development data” refers to the income tax and BDS datasets.
After the 6 data sources were merged, the final dataset comprised 1,301 zip codes with hundreds of features. In order to reduce the multicollinearity between independent variables, the variance inflation factor was computed for each set of features per data source. Features with a variance inflation factor greater than 5 were removed.
Next, a Lasso regression model was applied to each set of features individually and with each iteration, features with coefficient values of 0 were removed. This resulted in reducing the original exogenous feature space of 200 features to 50 features (17 ACS features, 1 Bikeshare feature, 3 Economic development features and 29 Google Trends features).
Model Selection & Tuning
Classical Time Series Forecasting: ARIMA(X)
- ARIMA(X) was selected as a baseline model since it is the classical approach to time series analysis using autoregression. One known disadvantage of this approach is that a single model would have to be fit for each zip code. Another is that it is unable to penalize model complexity and brings little value in feature selection.
Setup & Tuning: A separate ARIMA(X) model was fitted to each zip code’s ZRI time series (and exogenous features in the case of ARIMAX) from Jan 2015 to Dec 2019. Optimal orders were selected by grid search, with order (p=1, d=1, q=0) found to be optimal order for a large majority (>95%) of zip codes. Seasonal orders were investigated but were found to be unnecessary for a large majority of zip codes and were thus excluded.
In the order notation (p, d, q), p represents the number of lag terms, d represents the number of differences applied to the time series, and q represents the order of the moving average term.
Supervised Machine Learning: Lasso Regression & XGBoost
- Lasso was selected for its advantage as an interpretable linear model that provides feature importances. It’s advantageous for avoiding overfitting by removing less informative features.
- XGBoost was selected for its strength in capturing nonlinear relationships. It also offers feature importances as an output.
Setup & Tuning: The Zillow rent index, ZRI, variable was first logged and then standardized for each zip code using StandardScaler. Next, all training data was standardized for each zip code individually, using StandardScaler. For the training set, further feature engineering included exploring various time lags, inspired by the ARIMA model, ranging from 1 to 12 months back, moving averages, and moving standard deviations of the autoregressive (ZRI) and alternative data features.
Inclusion of these lagged features allowed adaptation of these supervised machine learning models to time series forecasting.
Both Lasso and XGBoost models were tuned using grid search with 5-fold cross-validation over the training period. Inverse transforms were then applied to the predicted and actual ZRI values (exponential function, and inverse scaling) to allow for the calculation of residuals.
Modeling Approaches & Feature Sets
In order to compare forecasting accuracy and feature importances, the models were trained on 4 different datasets, based on 3 types of modeling approaches, described below:
1) ZRI Autoregression (AR) only : Autoregression is a common technique in time series modeling and consists of forecasting future values of a target variable based purely on its past values. While simple, this method often yields high accuracy.
- Feature set #1: AR features (past ZRI values) alone
2) AR + some exogenous features: Past values of ZRI itself and past values of any other features are used to forecast future ZRI.
- Feature set #2: AR + ACS features to include general demographic information
- Feature set #3: AR + all exogenous features from all 5 alternative sources
3) Only exogenous features, no AR : Past values of the target variable are excluded from the model, and future ZRI is forecasted using only the past values of features from the other 5 datasets.
- Feature set #4: Exogenous features + no AR
Model Comparison: Forecasting National Rental Prices
Test set RMSE, in dollars/month, per model type and modeling approach/feature set:
The ARIMAX is not suitable for modeling all zip codes simultaneously. Because a single ARIMAX model was individually fit to each of the 1,306 zip codes, serious issues of overfitting were observed, especially when more exogenous features were added. Due to overfitting, the ARIMAX model’s accuracy quickly deteriorated.
The Lasso and XGBoost models successfully forecasted rent prices with an average error under $100/month. XGBoost gave the lowest and most stable root mean squared error, around $50/month, for the following feature sets: AR only, AR + ACS features, and AR + all features.
Including autoregression was instrumental for the XGBoost model’s accurate price prediction - the exclusion of ZRI autoregressors increased RMSE by $30.40/month. On the other hand, the Lasso model slightly outperformed its autoregression score when trained on all of the exogenous features alone (a slight difference of $1.20/month).
Data Source Evaluation: Discovery of Predictive Features
The figure above shows the absolute value of coefficients for the top 15 exogenous features in the Lasso regression model (left) and the top 15 exogenous feature importances in the XGBoost model (right). Some key takeaways:
1) Google Trends-derived features made up the majority of the top features with greatest relevance in predicting the output variable, rent prices per zip code and month. These features outperformed more traditional data sources, like the ACS with demographic, education, employment and infrastructure information, in summarizing an area’s rental market.
- “Unemployment,” ”Hurricane,” “Twitter” and “Black Lives Matter” Google searches were among the most relevant features in both Lasso and XGBoost.
- “Percent Workforce Unemployed” and “Median Building Age” were the only 2 ACS features that were highly relevant in both models.
2) Economic features from the alternative datasets, such as “State+Local Income Tax Percent” and “Number of Startup Firms,” were more relevant features for predicting rental prices than the ACS-derived economic features (including “Income per Capita” and “Poverty Rate”).
3) Similarly, the number of “Bike Share Stations” feature was a top feature in both models, and outperformed the transportation-related ACS feature, “Percent Commute Public Transport.”
All in all, the exogenous features derived from the alternative data sources provided additional predictive power to both the Lasso and XGBoost models.
In the above feature importances analysis, unemployment data features from two sources, Google Trends and ACS, were both ranked very highly within the top 15 features. To see how these features from the two datasets relate to each other, we investigated the correlation over time between ACS unemployment rates and search popularity for “Unemployment” and “Job Opportunities’”was calculated by zip code.
Both distributions (above) are approximately normal, centered at moderate to high correlation, with left skew from a small number of zip codes with negative correlations. This suggests that the information from the two sources is complementary and that Google Trends data might be a possible supplement or substitute for traditional data sources. Google Trends is able to provide the most recent information for known indicators, while traditional sources are not published immediately. In this study, the ACS and Google Trends features related to unemployment rates were strongly predictive of rental prices for several of the models.
While ACS unemployment rates are 5-year estimates and are only available ~one year after the data was collected, the Google Trends data on “Unemployment” and ”Job Opportunities” search popularities were available on a continuous basis, which is potentially hugely beneficial depending on the forecasting task at hand.
Model performance varied by population density. Shown above are average RMSEs for XGBoost for rural, suburban, and urban zip codes. Rural, suburban, and urban were defined following the US Census guidelines of 0-500, 500-1000, and 1000+ persons per square mile, respectively. The relative performance of XGBoost models trained with different feature sets was consistent across population density groups, but urban zip codes had ~25% higher error. This suggests that important factors in urban rental markets, such as income inequality, may not be fully accounted for by this analysis.
Data Science Conclusion
Ultimately, both the Lasso and XGBoost models were successfully deployed to forecast rental prices in 2019 with mean error below $100/month. Autoregressive inputs were crucial to high performance in the XGBoost model, and it performed best on the autoregressive + ACS feature set. The Lasso model had stable performance across feature sets, achieving the highest accuracy when trained on the autoregressive + all exogenous features set. ARIMA(X), while providing good accuracy with autoregressive data, was not suitable for a single generalized model of nationwide rent prices or for the predictor identification objectives of the project.
The client appreciated the unconventional approach and exploration of alternative data sources beyond broad demographic datasets like the American Community Survey. As described above, the Google Trends features, in particular, ranked highly in feature importance.
Future directions for this project include exploration of Google Trends data as a replacement or complement for other traditional data sources. For example, trends in the growth of bike sharing services might be tightly correlated with searches for “bike station near me." Appropriate Google search term popularities have the potential to provide information that is similar to features from a wide range of data sources, with the added advantage of highly recent data availability.
Thank you for reading about our work! Please check out our author pages if you would like to take a look at our other projects.