Using Data to Forecast Housing Rentals
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Legendary psychologist Abraham Maslow illuminated the significance of shelter in his Hierarchy of Needs. This is a mental model shaped as a pyramid used to represent human needs and wants, in increasing order of significance. Data shows that shelter, along with food and other physiological needs, being a primal physiological need. Therefore, it should be no surprise that real estate is the largest asset class in the world.
Real Estate Background Data
The global real estate market was worth an estimated $280 trillion in 2020.1 The United States residential real estate market represents a significant fraction of this. More specifically, the U.S. residential real estate market was worth an estimated $33.6 trillion in 2020; nearly as much as the GDP of the two largest economies - U.S. and China - combined. This amount is concentrated in approximately 140.8 million housing units, with approximately 80.68 million being owner-occupied units and 43 million units being occupied by renters (the rest of the units being vacant).2
The vast size of the real estate market represents an opportunity for consumers and investors alike. This is further made alluring with the current post-pandemic economic environment.
Future for Real Estate
More specifically, the burgeoning inflationary environment has ushered us into a seller’s market with Main Street consumers and Wall Street investors vying for the same properties.Homes are selling at a premium, given that demand exceeds supply. Hedge Funds and Private Equity companies alike are making headwinds into local neighborhoods. Blackstone recently purchased Home Partners of America for $6 billion3. Black Rock has raised north of $1.5 billion for its real estate debt fund4.
However, despite real estate being the largest asset class, it has fallen short when it comes to harnessing the power of data for due diligence and forecasting. In equities, you have the Dividend Discount Model, Discounted Cash Flow, Comparatives etc. In real estate you can use comparatives primarily to appraise homes at a given point in time; however, projection is virtually non-existent.
While equities as an asset class are more liquid and homogenous, real estate does not change hands as often and involves a great deal of variation. Furthermore, in equities there are bonafide indices such as the DJIA, S&P 500, Russell 3000 etc. as a pulse check of a broader market. In comparison, the real estate indices like Case-Shiller and MSCI have limitations with respect to geographic regions and do not capture variations from building-to-building or home-to-home.
The availability of real-estate data in recent years promises immense opportunities to bridge this analytic gap both in the homeowner and in the rental space. For this project we decided to focus only on the latter and exploit the publicly available data to pin down the idiosyncratic features which drive rent prices in the United States. We take a deep dive into the data and apply a range of tools and techniques to project and forecast rent prices in the United States.
Data Preprocessing and Exploratory Data Analysis
We have employed data from four different datasets to facilitate model building and analysis. The datasets used here include Zillow Observed Rental Index (ZORI), the American Community Survey (ACS) demographics data, Federal Housing Agency’s House Price Index (HPI), and High-School Graduation Scores from US Department of Education(Edu).
To begin with, we investigated the ZORI dataset. This dataset consists of 2,726 observations spread across 93 columns. The original features in this dataset include RegionID, RegionName, SizeRank, MsaName, and 89 columns containing ZORI values for each month and year from January 2014 through May 2021. 'RegionName' gives us the zip code of a particular region, 'SizeRank' ranks zip codes by urbanization, 'MsaName' specifies the metropolitan statistical area for the region. ZORI is a smoothed measure of the typical observed market rate rent across a given region5.
As part of initial data pre-processing, we consolidated the year-month columns into a long format and created “Date” and “ZORI” columns containing the corresponding values. We then isolated the month and year information as separate features from “Date” column resulting in 242,614 observations spread over 8 columns. After filtering for rows with missing ZORI values (~3.4% of the dataset), we obtained a clean dataset of 234,408 records. For EDA, we computed the median ZORI for each year for each zip code and investigated its trends in highly urbanized zip codes (SizeRank<50).
Figure 1A Data
Figure 1A plots median ZORI against high-ranking New York city zip codes. It reveals that the rental index has continued to rise through the years although it declined in 2020. The decline may be due to COVID-19, which has prompted many people to migrate from cities to suburban areas. On the other hand, the same data for high-ranking Houston zip codes (Figure 1B) reveals a consistent increase through the years and in fact greater increases in 2020 and 2021.
Relationship Between Seasons and Zipcodes
Further, looking at seasonality in NYC, we noted that ZORI had increased in most years from one month to the next, while in 2020 a down trend was observed from January through December. In Houston, Texas, as seen before, rental index has trended upward through the months in 2020.
The next dataset is the ACS demographics, which has separate files for each year ranging from 2014-2018 on bigquery. Each set has over 33000 observations across 253 variables. We first combined all the years resulting in a larger dataset of 165,600 observations. We dropped 11 features that had greater than 60% missing observations, leaving us with 242 features. The resulting dataset was then merged with the ZORI dataset by zip code and year.
After the rows from 2019-2021 with no demographic information were removed, the resulting dataset consisted of 155,777 observations across 262 variables. The ACS dataset has several redundant variables that had to be removed, selected or engineered to reduce dimensionality.
In order to address the redundancy, we carefully split up the features into 10 categories based on their characteristics. The 10 categories are occupancy, dwellings, income, education, employment, population, commute, rent, socio-economic and miscellaneous. Features inside each category were then analyzed for their correlation with ZORI using scatter plots (Figure 3) and correlation matrix.
Detailed Analysis and Data
Further, we engineered new features capturing information contained in original features and removed the original ones. For example, there were at least 19 income related features with smaller income range bins such as income less than 10K, income between 10,000 -14,999, income between 15,000 -19,999 etc.
We consolidated these to form larger bins, for example, income ranging from 0 - 29,999, income from 30,000 -74,999 and so on. In addition, to facilitate feature selection, we performed literature review. An article published by Harvard University in which authors have examined the major driving forces of the rental economy and how they have changed over the years6 contributed to our feature selection process. With reference to literature review, visualizations and results of correlation matrix, we chose a set of 30 features from the ACS dataset to be used in modeling (Table 1)
Furthermore, we filtered the dataset to remove outliers or rentals that were very high-priced i.e. greater than $6000. Additionally, we created another dataset incorporating the above chosen features from ACS and ZORI and adding house price index and % annual change from HPI and high-school graduation scores from Edu datasets, respectively. There were some missing values in high-school graduation scores and we imputed those using median values from previous year corresponding to each region.
Model Fitting and Forecasting ZORI
To forecast rental indices for 1-3 years into the future, we fitted multiple linear regression, random forest and time series models on ZORI. We first split up the combined ACS and ZORI dataset into training and test sets with data for years 2014-2017 comprising a training set (~73%) and data for 2018 forming a test set (~27%).
Multiple Linear Regression (MLR) Data
For multiple linear regression, standard scaling was applied to all features in the training set and the test set was transformed based on the fitted train set. We performed the same steps on the second dataset that also contained features from HPI and Edu in addition to ACS and ZORI. Subsequently, we fitted linear models on both datasets separately and computed R2 and RMSE scores (Table 2).
Data on Random Forest
For a random forest, we fitted 4 models. Prior to creating the models, we established the optimal hyperparameters through cross-validation and grid search on the training dataset.
Models 1 and 2 (Table 3) were fitted using features chosen just from ACS and ZORI datasets. The only difference between these 2 models is that model 1 was fitted using all observations including outliers/high-priced rentals, while in model 2, we had removed rentals priced higher than $6000.
Models 3 and 4 were fitted using features from HPI and Edu datasets in addition to ACS and ZORI. Again, the difference between models 3 and 4 lies in the presence or absence of high-priced rentals. The R2 score for training and test sets at best estimated hyperparameters and RMSE for the 4 models are shown below in Table 2. From here, we can conclude that removing the outliers results in forecasts with lower root mean squared error.
Figure 4 below compares the results for year 2018 as forecasted by Random Forest and MLR models against the actual values of ZORI of the zip code NY 10025. As expected, Random Forest's forecasts had higher accuracy compared to MLR.
Data on Time Series
In contrast to linear regression models, which assume that the underlying data is independent and identically distributed or i.i.d., time series models work with data that by definition is correlated with the previous data. Time series models are mostly used in econometrics and finance where an exogenous random variable, for example GDP or asset price, is observed with respect to the dimension of time.
Most common approach to analyzing time series is univariate analysis which exploits the relationship of the past observations to each other to make forecasts. An example of the univariate model is the ARIMA model. If the explanatory variables are available, multivariate analysis can be used as well which, as the name suggests, uses an interplay between multiple exogenous variables to the endogenous variable.
For the kind of data available for this model, time series analysis can be further extended to analyze multiple temporal observations for different subjects by using a cross-sectional framework.
However, for our analysis we decided to build individual time series models to forecast ZORI for each zip code separately. We selected a special type of multivariate analysis called the Vector Auto-Regression (VAR) model. The VAR model combines the univariate and multivariate approaches, by modelling the exogenous and endogenous variables as a linear function of the past values of themselves and the past values of the other variables all of which influence each other in a feedback loop.
Specifically, from the data used in multiple linear regression, we selected 20 features to build the VAR model. However, that proved too ambitious and we had to narrow down to 5 of the highest correlated features (obtained from the correlation matrix).
Subsequently, we made forecasts for the top-100 zip codes in the U.S. ranked by size rank (urbanization) using the VAR for each zip code. For the EDA of time-series, the histograms and scatter plots do not capture the temporal component, such as trends and seasonality.
Data on Time Periods
In addition, EDA of time-series should also evaluate the stationarity condition, which means that the distribution is unchanged for any time horizon. Thus, an AutoCorrelation or PartialAutoCorrelation graph is more helpful, which basically depicts how a point in time t is correlated to t-n, where n is the time horizon from 1 period to multiple time periods.
It goes without saying that the rental market is prone to cyclical and seasonal fluctuations. As illustrated by the diagram on the left (Stationarity, Cyclicality, and Seasonality) the troughs and crests depict a recurrence of varying duration -- a subtle, yet profound distinction between cyclicality and seasonality. The diagram on the right identifies the seasonal component as depicted by the time durations which are almost evenly spaced.
Once trends and seasonality are identified, they can be transformed through differencing. Differencing, as the name suggests, computes the differences between the current and prior observations. For our purposes, first order differencing sufficed.
We used the first size rank zip code, 10025, as an example to depict the effectiveness of the time series model to better understand it relative to linear regression models.
Time vs ZORI (Rent Index) Data
Time-Series Model Efficiency for 10025
We presented the results in both graphical and tabular format for our corporate partners, Markerr. This was done using the R Shiny app. A map of the U.S. is shown with pin drops for the top-100 zip codes. When each pin is clicked, it shows the zip code and the year-over-year forecast for rents up to 3 years. For the tables we delved into more detail by again showing the top-100 zip codes and the corresponding monthly rents.
The results showed that the rents increased gradually across the board, except for certain zip codes in Texas. In particular, for zip codes in Houston, and El Paso, TX, the model forecasted a year-over-year decline in rents, which is in contrast to what we observed in our EDA. Notably, the historical data did not account for COVID-19. In reality, due to the pandemic people were fleeing major urban cities, such as New York, and taking refuge in cheaper and suburban locales. That phenomenon presents a shift in standard patterns that should merit further study.