Using Data Forecasting Zillow Rent Index
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
GitHub
Data Problems
Texas rental prices are currently surging, surpassing pre-pandemic values, so there is benefit to identifying regions of high and low rental growth. The Zillow Observed Rent Index (ZORI) is a measure of the rental rate listings across the United States. It is gathered by monitoring rental units over time and tracking price changes, then aggregating the price changes by region. After identifying several locations of interest with varying patterns of behavior over the last few years, our group decided to focus on Texas.
Texas is the second largest state in the United States, both in terms of geographic area and population. The latter has grown 15.3% in the span of 2010-2020 with the major metropolitan areas experiencing a 19-30% growth. It also has the second largest state economy.
Supplies and Demands
Because of economic growth and demographic trends, the housing market has struggled to keep up with the demand for rental units, leading to an increase in rental prices. The time period we covered with the available ZORI data included the Covid-19 pandemic that caused drastic changes in employment and eviction rates, as well as migratory patterns across the country. We were interested in exploring the impact those events had on the rental market.
An accurate forecast model could help potential landlords and property owners who are looking for profitable investment opportunities by identifying regions of maximal future growth. We started with a naive approach to predicting high growth regions by identifying areas with increasing rental prices in recent time frames. We later used this naive approach as a benchmark for evaluating our model performance.
To aid potential investors, we were interested in including exogenous factors that could predict future prices more accurately than could be done by simply looking at past rent prices.
By using the standard autoregressive features, economic indicators, and the newly added demographic factors, we developed a model based on several years of historical information that could be used to predict rental price changes over the next year.
Data Sets
When including external data sources, we wanted to include a variety of data types at different geographic levels and time frames. The data initially included:
- Smoothed, Seasonally Adjusted Zillow Observed Rent Index
- American Community Survey
- Texas Regular All Formulations Retail Gasoline Prices
- S&P Case-Shiller
- US Ten Year Treasury Note Yield
- โโTexas Commision on Environmental Quality Notices of Violation
- WorldWeatherOnline Historical Weather Data
- โโHHSC CCL Daycare and Residential Operations Data
- University of Texas, Texas Tribune Poll
- Texas Covid-19 Data, Texas Dept. of State Health Services
- Key Economic Indicators
- New sales tax permits
- City Tax Revenue
- Unemployment rates for each city
The best indicator for a future monthโs rent is the previous monthโs rent, so historical ZORI values for each zip code were included. ZORI values were available for five Texas metropolitan regions: Austin (38 zip codes), Dallas-Fort Worth (104 zip codes), San Antonio (31 zip codes), Houston (40 zip codes), and El Paso (2 zip codes).
Features gathered from the American Community Survey provided vital population level information used to train the model. We used the 5-year averages collected from 2011 to 2019 using zip code tabulation area data tables provided by census.gov. This dataset included a wide range of related information, including demographic, lifestyle, and housing information.
Economy Data
We included several different ways of representing the state of the economy on the municipal, state, and national level. Some of these included the S&P Case Shiller Home Price Index, consumer confidence, consumer price index, unemployment rate, retail gas prices, gross value of crude oil and natural gas production, and single family building permits. We also wanted to include an overall measure of how Texans feel about their well-being gathered from polling through the University of Texas / Texas Tribune Poll.
Poll Data
The poll question we referred to was: "Compared to a year ago, would you say that you and your family are economically a lot better off, somewhat better off, about the same, somewhat worse off, or a lot worse off?", and the possible responses included: "A lot better off," "somewhat better off,โ "about the same,โ somewhat worse off,โ "a lot worse off,โ "donโt know". The responses were provided as population percentages.
In an effort to include more regional information specific to the local Texas areas, we identified data that could represent retail and business activity. To accomplish this, we included data regarding business permits, which provided information about new business openings, what type of businesses they were, and where these openings were happening. We also included city government retail tax revenue. We also added weather to the set.
Data Cleaning
To clean and interpolate the target variable, ZORI, the dataset was placed into a wide format to separate the zip codes and identify any null values within each location. A linear interpolation was done across time for each zip code. Any remaining null values were leading nulls, which were replaced by the first non-null value in each zip-codeโs time series. There were no more than 6 leading null values in any given zip code, while the majority of zip codes had none to begin.
Because the raw datasets were at different geographic levels, and the time increments also varied, each set had to be standardized. We chose to do this at the zip code level and monthly frequency. Consequently, data provided at the county, city, metropolis, state, or national level, had to be mapped to each zip code. On the timeline, data provided at a daily frequency was aggregated to match monthly time increments.
For data that was reported on a quarterly or annual basis, the values were repeated for all months that corresponded to the given quarter or year.
The cleaned and engineered features were merged into a single dataframe in the long format with every row representing one monthly time point for each zip code. To be able forecast a year into the future, the model would need to be able to predict any given date based on features from the past. With a year timeline as our forecast window, the features were then shifted forward so that the most recent value for each feature was a year ahead. That meant that July 2022 was based on the last date for our target variable, July 2021.
This was a one year shift for all features except for the American Community Survey (ACS) data, which required a three year shift. The shift now means that 2021 economic data and 2019 ACS data would forecast 2022 Zillow rent price index data.
We included several auto-regressive features from the ZORI data set. First, the rent index from exactly one year prior to the prediction date was added to the feature set. We created features by finding the differences between different intervals of the year: specifically a one month difference calculated by subtracting the rent from the previous month, a 6-month interval, and a 12 month interval.
A few other features were engineered from the raw datasets. To simplify the polling features from the several distinct categories in the polling data, one representative feature was produced by subtracting the net negative responses from the net positive responses and calling it net approval.
From the tax permits dataset, two features were created to represent the percentage of businesses opened by small business owners and large (LLC) businesses by identifying the type of business owner (small or large) for each entry in each zip code and dividing by total permits issued.
Data Science Modeling
To set up the modeling, we decided on a train window of six years from July 2014 to July 2020 because those dates represented the most available and complete data after time lagging. The test window was one year from Aug 2020 up to August 2021. The test window had to be at least the same length of time as our forecast window (a year).
Our initial baseline model was a multiple linear model with all numerical features and no engineered features or autoregressive features. In total, our model was trained on 266 raw features. It returned an r2 value of 79% and an RMSE of 106.19 on the test set.
Comparison among the models
Moving forward, we compared multiple different models to the baseline. However, given the large number of features, we found it necessary to perform some feature reduction on the data set and check for multicollinearity issues before developing additional models. The data was split into two groups: ACS and non-ACS. We grouped similar features and iteratively removed them one by one based on Variance Inflation Factors (VIF) and the selections from early Lasso and Catboost models. Through this we were able to reduce each set down to 43 from 189 and 13 from 77 features respectively.
The two groups were then merged into a single dataframe before further feature engineering.
Feature Reduction
For each selected feature we calculated the annual percent change for each zip code. This doubled the number of features, and we implemented a second round of reduction using the same iterative process of VIF and Lasso to help select the best options. This allowed us to see if the annual changes were more predictive of future rental prices than their raw values counterparts. This final, reduced feature set was then used for developing all additional models.
All models were able to produce an r2 value of at least 94% on the test set. However, even with a high r2 value, there can still be large prediction errors. Moreover, given the context of our problem (predicting/forecasting the rental market), we had less room for such errors. For this reason we chose RMSE as our primary evaluation metric. We tried several different models, including Random Forest (RMSE 58.69), Catboost (RMSE 66.18), and XGBoost (RMSE 61.22).
Each model was tuned using cross validation and grid-search to attain optimal performance. With the reduced dataset, the non-linear models performed similarly but were unable to outperform Lassoโs RMSE score of 53.56.
The choice for the test period was challenging because of the anomalous reactions to the Covid-19 pandemic in the past year and the need for a year-long forecast window.The test period for the models began in August of 2020, which coincided with the time that many pandemic restrictions were lifted, and Texas saw a dramatic change in its ZORI.
The problem we had here was that the year lag meant that our model would be making predictions for 2021 based largely on the worst months of the pandemic in 2020.
The Lasso model had the best performance because it was the most responsive. The model did struggle with predicting the sharp increase in rental prices after the initial shock of the pandemic. The largest errors occurred in the later months of the test window when the median prediction was underpredicting actual rental prices by almost 5%.
To further evaluate the model regionally, we omitted one of the five metropolitan areas, Houston, and trained the model on the remaining four regions. Once the model was trained and developed, it was tested on Houston, resulting in a low RMSE of 38.46. With this result, we were confident the model was robust at a regional level and we could move on to forecasting.
Using the optimized Lasso model, we forecasted rent values for each month and zip code from July 2021 to July 2022 using the selected lagged features leading up to July 2021 as inputs. Once we had rental estimates, we produced year-over-year rental percent changes for each zip code to evaluate the estimated increase or decrease in each given area. The model used a direct approach, predicting a full year in advance using the previous yearsโ features, rather than an iterative approach where predictions would have been created one month at a time using previous predictions as inputs.
Data Evaluation
To evaluate the performance of the model forecasts, we compared model results to a naive approach for identifying investment opportunities. The naive approach selected zip codes with a 4% or greater year-over-year growth in the most recent year of the training dataset.
The selections from the tuned models for highest potential growth, also exceeding 4%, outperformed the naive model. This was validated on the test dataset. The median growth selected by the model was higher than that of the naive approach. Zip codes selected by both the naive approach and our model had the highest growth of all four sub-groups. Our model is well equipped to identify up-and-coming areas that will experience new growth in areas that have not historically been growing.
Our model and the naive approach both point to Dallas as a place for future growth, the greatest increase of which shows up in the suburbs to the south of the city center. The model alone predicts similar growth in a few zip codes in the far north and far west of the Dallas suburbs. The model also predicted that a clustered group of zip codes on the east side of San Antonio will see a rise in rent prices.Generally, across the four largest metros, the model predominantly predicted growth in the suburbs, moving away from city centers.This prediction makes sense due to the increase in demand for suburban homes throughout the Covid-19 pandemic.
Most of the errors the model made between the training and test period occurred due to underprediction. The errors were examined for each month in the test year, showing the increasing magnitude of error with time as the model was unable to keep up with the rapidly increasing rental rates.The model generally gave conservative predictions, the largest errors occurring from under-predicted the rent increase for some of the recommended zip codes.
This can be seen by the large errors in the far north and far west of the Dallas suburbs, which were zip codes that our model alone had recommended. One of the largest mistakes our model made was under-predicting the growth of Austin after Covid-19 restrictions were lifted. There was a 7% under prediction for a zip code in the west of Austin.
After validating that the model can perform better than simply looking at historical growth, we looked at what areas our model forecasted for the greatest growth. The model forecasted that the growth in rental prices will continue into 2022, predicting the largest growth in the suburbs of each metro. The zip codes in the north of Dallas continue to be places of high expected rent growth. Dallas is forecasted to have seven of the 10 highest average rent increases. The model forecasts the cluster of zip codes on the east side of San Antonio as areas of high growth as well.
Feature Discussion
Weโve seen that the model is able to outperform forecasts that solely rely on previous rent growth.To investigate what non-auto-regressive features most contributed to the modelโs predictions, we looked at the top coefficients of our model. By identifying the largest coefficients from the model, we could see what information the model used to outperform more naive approaches.
A majority of the top coefficients came from the American Community Survey (e.g. number of 50+ housing units, number of single family housing units, annual percent change in children), information that could have been useful to differentiate neighborhoods that tend to have rents increasing at different rates. Statewide economic features (e.g. Natural Gas Production, Texas Net Approval) were still important to determine the overall trend in rent change but could not help differentiate between individual zip codes.
The polling feature, representing consumer sentiment regarding their economic well-being, was repeatedly selected by the model over some of the more traditional economic indicators. Itโs possible the model selected Texas polling because it reflected a sharp reaction to the pandemic, while SAP Case Shiller and the Consumer Price Index showed a steady increase through the pandemic period. Texas polling also seems to be more aligned with consumer confidence but is less noisy, which might also be a reason the model preferred it.
Another feature that emerged from the model was representative of a trend we noticed amongst features in general: suburban regions were growing at a higher rate than more central regions. Two features that acted as representatives of this pattern were the โTotal Local City Tax Revenueโ and โNumber of 50+ housing unitsโ from ACS, which is representative of density high-rise buildings and higher population density.
Data on Tax Impacts
The local city tax revenue feature was meant to track local economic development based on the assumption that a growing city should collect more tax revenue. After mapping the tax revenue data, it became clear that it was able to differentiate different neighborhoods and, more specifically, if those neighborhoods were in the city center or the suburbs.
Data on Tax Revenue
In Texas the city centers have a larger local city government to provide services and also collect the most sales tax revenue. The suburbs are smaller, more fragmented, and collect less total sales tax revenue. The model probably found that total city tax revenue can serve as a good proxy to differentiate city centers and suburbs. It has the largest coefficient for non auto-regressive features and might contribute to the lower predicted rent for city centers relative to the suburbs.
Number of 50+Housing Units by Metro
This same tendency to penalize dense city centers is more clearly seen in the negative coefficient for the ACS feature representing the number of 50+ housing units. The logic is the same as above that the negative coefficient drops the predicted rent for the city centers relative to similar suburban zip codes. The areas with the highest density of high rise apartments all fit within zip codes that had the highest total local city tax revenue.
Conclusion
In the end, the Lasso model was the best of the models we trained. It took a conservative approach to identifying zip codes in the metropolitan areas of Texas that were likely to show the highest rental price increase over the next year. It was successful on a regional basis and was able to outperform a naive approach of simply looking at previous rental increases.
However, the model still underperforms in some regions and wasnโt fully able to adapt to the anomalous behaviour throughout the pandemic In future iterations of this project, we would want to see if we could incorporate Covid-19 or pandemic restriction information to help the model better fit the uptick in rental prices seen in 2020-2021.
Future Works
Furthermore, we believe that ensembling techniques could be used to improve the results by combining some of the advantages of both the linear and non-linear models. These next steps will be left to future work.