Forecasting the Zillow Home Value Index Using Three ML Techniques

The housing market is often volatile and difficult to predict, but, fortunately, machine learning offers several options for forecasting time series. When applied to Zillow’s open access data, it can help investors make informed real estate decisions.

We collaborated with Haystacks.AI to develop a deployable time series forecasting product that the company can use to gain real estate insights for clients. Our goal was to create code that generates forecasts of the housing market using three different forecasting models for a given metropolitan area.

This work serves as a comparison of the three forecasting methods in terms of their implementation, ease of use, and quality of results. Using multiple forecasting models provides an idea of the consistency of the results, allowing for more informed real estate decisions.

Data

This project forecasts the Zillow Home Value Index (ZHVI), a home market value index provided by Zillow, at the level of statistical metropolitan areas. These time series have already been smoothed and do not contain a seasonal component. Below is a plot of some of the areas over the full range of the dataset, from January 2000 to November 2022. There are notable patterns throughout the time series like a bump in 2007 and a spike in 2022. The models created in this study were designed to forecast the ZHVI for any of these areas into the future.

Some of the models we implemented took into account exogenous variables which were chosen as realistic drivers of market value. Based on exploratory data analysis, the variables we chose to include were metropolitan area unemployment rates, the average US 30 year fixed mortgage rate, an average urban consumer price index, a measure of new single-family homes sold, and a measure of new single-family homes for sale. The datasets were sourced from FRED and the US Census Bureau. All exogenous variables were national values, except for the unemployment rates which were on the scale of statistical metropolitan areas.

Modeling

We chose three time series forecasting models: Prophet, Vector Autoregression (VAR), and Bayesian Vector Autoregression (BVAR). Prophet, developed by Facebook, uses only the independent variable, in this case ZHVI, to make the forecast. VAR and Bayesian VAR differ from Prophet in that they also use the relationships between ZHVI and the chosen exogenous variables to make the forecast. 

Prophet

Prophet is an easy-to-implement time series forecasting model with a similar execution to Scikit-Learn. The only input required for the model is the target time series and values for the hyperparameters, which can be tuned with cross validation. The figure below plots the forecast for the ZHVI in New York, NY one year into the future. The forecast is the blue line and the confidence interval is the blue shading.

As you forecast further into the future, you can expect the confidence interval to get wider and the error to increase. Plotted below is the mean average percent error (MAPE) of the New York, NY forecast relative to the length of the forecast. This demonstrates that the MAPE increases as the forecast horizon increases; at the one year mark, the MAPE is about 0.04.

Vector Autoregression (VAR)

The VAR model is more complex than Prophet in that it incorporates the exogenous variables. The VAR model creates one equation for each variable. It is then fit to a stationary time series that regresses on the prior values. It iterates through each equation using ordinary least squares and generates forecasts in a recursive manner. It allows for bidirectional relationships, meaning it determines not only how the exogenous variables impact the ZHVI but also how the ZHVI impacts each exogenous variable. This is a more realistic approach than a unidirectional relationship. 

The below plot shows the VAR forecast of the ZHVI in New York, NY one year into the future. The forecast is the orange line and the confidence interval is the orange shading. This forecast had a MAPE of 0.03, similar to that of Prophet.

A benefit of VAR is that it allows for Impulse Response Function (IRF) analysis. An IRF is the output when a dynamic system is given a brief input signal. The figure below shows the IRFs of the exogenous variable’s effects on the ZHVI in New York, NY. The blue lines are the IRF and the dotted lines are the confidence intervals. The unemployment rate creates a negative shock to the system. This means it has a negative effect on the ZHVI in the first few months and then slowly converges towards zero with time. The median consumer price index, mortgage rate, and the number of homes for sale all have slow negative influences on the ZHVI that become more negative with time. Homes sold has the same shaped curve but in the positive direction.

Bayesian Vector Autoregression (BVAR)

Bayesian VAR differs from VAR in its approach to estimating the parameters. Rather than using a maximum likelihood estimation method, the parameters are estimated through posterior distributions that are generated by conditioning on the observed data. Forecasts are then made by sampling from the posterior distributions. At the model level, the Bayesian VAR includes an observation level noise term for stability of sampling. 

Below is the plot of the New York, NY forecast one year into the future where the green lines depict the forecast, including its uncertainty. 

Below is a plot of the posterior distribution for the lag coefficient. Instead of a point estimate of the lag coefficient, BVAR produces a distribution of the possible values of the lag coefficient. A sharper peak would mean more confidence that the value is close to the peak. Here we can see the posterior predictive mean and the observed line up well, indicating that the model considers the data that it has been conditioned on to be plausible. That's a good sign for our modeling process.

Final Thoughts

Prophet, Vector Autoregression, and Bayesian Vector Autoregression all have their own strengths and weaknesses for implementation and results. Prophet is easy to use and its implementation is similar to Scikit-Learn. It is a bit of a simplistic approach and can be better for less complex time series. VAR has the ability to model realistic multivariate relationships, though it can be difficult to interpret the coefficients of the results. Although Bayesian VAR has highly flexible model building and interpretable posterior distributions, the model has to be manually implemented in PyMC and the Bayesian workflow can be tricky.

To achieve our goal of making the code into a re-runnable product with Haystack.AI, we combined our three models into a single Jupyter notebook. A user can select a metropolitan area of interest and view the three different ZHVI forecasts along with their corresponding MAPEs. This user-friendly code provides three forecasts from which housing market decisions can be made.

About Authors

Grainne O'Neill

As a soon-to-be Ph.D. graduate with a background in mathematics and a passion for data science, I am seeking opportunities to leverage my skills and enthusiasm for solving complex problems through data-driven insights.
View all posts by Grainne O'Neill >

Sarah Beth Powell

I'm a proven project manager, with curiosity in data science and solving problems through statistics, math and coding. I have over 8 years of experience ranging from people analytics in human resources to assortment optimization in retail. With...
View all posts by Sarah Beth Powell >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI