Optimizing Real Estate Price Prediction with Unsupervised Learning
A project with New York City Data Science Academy and Haystacks.AI. Code available on GitHub.
Real Estate Investing and the Single Family Rental Market
Real estate investment offers distinct advantages to both individual and institutional investors, including tax benefits, cash flow, and a hedge against inflation, just to name a few. For investors interested in residential real estate, the fast-growing Single Family Rental (SFR) market has become an increasingly attractive option.
According to Roofstock, more than one-third of rental units in the U.S. are SFRs, with the total number of SFRs having jumped 25 percent since 2012. As rents and home prices demonstrate strong year-over-year growth, investors face a unique value proposition. Unlike multifamily residential properties, investors can acquire SFR properties with less money down and lower interest rates.
But amid a national housing shortage, with pent-up demand and low supply, building a strong residential real estate portfolio requires a sharp eye for smart buys—or a sophisticated predictive model. Here we focus on unsupervised learning techniques to improve the performance of a basic Automated Valuation Model (AVM).
The primary objective of this exercise was to improve price prediction by first using unsupervised learning techniques to cluster real estate listings. To this end, our workflow included the following steps, which are the focus of this article:
- Data cleaning
- Merging supplemental data
- Feature engineering
- Dimensionality reduction
- Clustering with k-means
- Clustered prediction with Random Forest
TL;DR: The analysis revealed a modest improvement in price prediction using clustered data. In particular, our smallest cluster, which is a dense urban cluster, demonstrated the best performance with respect to prediction. We found that there are several tradeoffs investors should be aware of when making decisions about investments in this particular cluster. Urban density brings with it relatively high property prices and greater proximity to amenities. But there are also higher crime rates and a high price per square foot. Insights like this could be useful for investors and their advisors, as well as those in the Property Technology (PropTech) industry.
Data Cleaning and Exploratory Data Analysis
Starting with 31,064 property listings for the state of Georgia, sourced from various platforms (e.g., Zillow, Redfin) and provided to us by colleagues at Haystacks.AI, we first filtered detached single-family properties. Then, we dropped listings with zero square footage and columns that were no longer relevant (like unit count) or had too many missing values (like number of half baths). After dropping outliers and imputing missing data the clean dataset included 11,491 unique listings.
As a next step, we added supplemental data to the original dataset to increase the number of features. The idea here was that including more features could help improve clustering and prediction down the line. Supplemental data on schools and crime came from Haystacks.AI. But to produce additional useful features, we scraped ‘high quality’ grocery stores (for simplicity we defined this as Whole Foods Markets or Trader Joe’s) data from Superpages Classifieds and walk score data from walkscore.com. To reduce time scraping walk scores, we did so for zip codes rather than individual property listings. Finally, we included bank branch location data sourced from the Federal Deposit Insurance Corporation (FDIC).
Using longitude and latitude data, we computed the haversine distance to the nearest school, bank, and grocery store for each property listing. Walk score data and crime rates corresponded to zip codes and so we simply merged the data on the zip code column. Altogether, supplemental data added 20 new features to our dataset, 16 of which were detailed crime rates at the zip code level.
Before moving on to the modeling, we did some basic visualizations to get a sense for underlying correlations and do a quick sanity check. Plotting price against square footage we saw a positive correlation, with most properties falling below 4,000 square feet and $1 million.
As we might expect, the data showed a negative correlation between property price and crime. Here we’ve plotted the violent crime rate (per 1,000 residents) against price, revealing a weak negative relationship. All of this makes sense and we can rest assured there are no major outliers to skew our modeling.
Dimensionality Reduction and Clustering
To reduce dimensionality prior to clustering, we explored a few different techniques. First, we used Principal Components Analysis (PCA) to reduce dimensionality among numerical features. The results showed that the first three components explained roughly 58 percent of the variance in the dataset (good not great).
Next, we used Multiple Correspondence Analysis (MCA) to reduce dimensionality among categorical features. Because the categorical features included various geographic variables, such as county, city, and zip code, which each included hundreds of unique values, MCA did not handle the categorical data very well. Individual components returned by MCA explained less than 1 percent of the inertia (a measure comparable to explained variance with PCA). Including only a subset of non-geographic categorical features improved the MCA results a bit, so we attempted to include them. To do so we brought numerical and categorical features together with a generalized PCA algorithm, a technique called Factor Analysis of Mixed Data (FAMD). For details related to FAMD calculations, please refer to this article, which we consulted for our analysis. As it turned out, FAMD results still fell short of the original dimensionality reduction using PCA on numerical features. For this reason, we excluded categorical data from clustering and prediction models.
The next step was to cluster the data. While we did experiment with a few different clustering algorithms, including DBSCAN (which is well suited to geographic data), in the end we obtained the best clustering results using k-means.
The separation between the larger blue and green clusters is not very pronounced. But the purple cluster stands out quite a bit, particularly on the axis for principal component one. By looking back to the PCA results, we could see that the first component was highly correlated with crime data, a fact that became important when we began developing our supervised models. With more features, or perhaps a more robust clustering algorithm, we may have been able to develop better clusters. For this exercise, however, we wanted to see what we could accomplish with the features we already had. And it turned out that even these clusters improved prediction.
The Impact of Clustering on Price Prediction
To test the performance of predictive models on clustered data, we first mapped our cluster labels back to our original data (keeping only the numeric features utilized in PCA and clustering). Then we train test split our full dataset (setting aside 30 percent of our data for testing our models), and further segmented our training and testing data by cluster. In the end each of our three clusters had training and testing features, and training and testing price data (our target). The size of our clustered data ranged between several hundred and several thousand property listings for training and test data.
Given that crime rates proved important to principal component one, which played a significant role in separating our data into clusters, it was important to use crime data in our prediction models. However, among the 16 columns of crime rate data there were several that were highly correlated, raising concerns about multicollinearity. For this reason it made the most sense to not use a linear model. Since tree-based models do not assume linearly independent features, we decided to fit a Random Forest Regression model to predict prices.
Knowing that our objective was not to necessarily build the best predictive model, but rather to demonstrate the performance differential between clustered prediction and standard prediction on the full dataset, we did not spend much time tuning the model. All random forest models used 100 trees with a maximum depth of 4.
The results: Among our cluster models, coefficients of determination ranged from 0.517 to 0.637 on test data, with the best performance on the smallest cluster (which had the best cluster separation). The average test score on cluster models was 0.569, compared to 0.539 for the standard model. The difference is not large, but is reasonable considering the relatively limited data and modest results from PCA and clustering. With more data and better clustering, we could expect even greater improvement in prediction when leveraging clustering techniques.
Understanding the Clusters
Because k-means clustering relies on Euclidean distance, which is not appropriate for map coordinates, we left out longitude and latitude data. Even still, we did include three distance measures: distances to the nearest bank, grocer, and school for each property listing. These features, along with others representing property and neighborhood characteristics yielded clusters that follow a fairly intuitive geographic distribution.
The large blue cluster labeled outer ring cluster forms a large ring around the Atlanta metropolitan area, extending deep into the suburbs across the state. By contrast, the small purple cluster labeled the downtown cluster represents the urban center of the city of Atlanta (with a few scattered properties in Savannah and elsewhere). The green cluster labeled inner ring cluster has perhaps the worst separation of the three, with diffuse representation across the map, often overlapping with the blue cluster. It would appear that the green cluster captures inner ring suburbs with moderate urban density, as well as properties scattered across the metropolitan area and rural areas as well.
The outer ring cluster is consistent with a desirable suburb, having the highest prices and lowest crime rates. This contrasts with the downtown cluster, which has relatively lower property prices and much higher crime rates. The downtown area also comes with an expected premium on space, having the highest price per square foot of the three clusters.
Unsurprisingly, downtown properties benefit from close proximity to amenities and thus have the highest average walk score. The distance to a high quality grocery store is also a fraction of what it is outside downtown. Though not displayed here, distances to the nearest school and bank are also lowest in the downtown area.
Summing it Up
Unsupervised learning methods such as clustering algorithms can often feel a bit enigmatic—there is really no one right answer. But they can also unearth invaluable insights and optimize performance when paired with more familiar supervised learning models. The goal of this exercise was to demonstrate how clustering can help us identify and understand natural groupings among real estate listings, and then leverage our clusters to improve price prediction. Despite having relied heavily on alternative data (as opposed to traditional financial metrics), we still produced reasonably insightful clusters and found modest improvement in prediction with random forests. This approach, when refined and implemented at scale, could be transformative for real estate investors and professionals in the PropTech space.
Limitations and next steps
Given more time, our team would undertake several next steps: First, the modeling would benefit from more data. Adding additional points of interest and more granular data (address-level versus zip code level) could improve clustering. The clustered AVM technique may also benefit from more robust clustering algorithms and perhaps clustering of smaller geographic units (e.g., counties). While we did cluster with DBSCAN, it is possible that other approaches like agglomerative clustering or Gaussian Mixture Models would yield better results. Lastly, we may be able to achieve even better prediction results with more finely tuned and more sophisticated supervised learning models.
Acknowledgements
I am grateful for the support and collaboration of the following people: Lauren Tomlinson, Data Science Fellow, New York City Data Science Academy; Joe Lee, Chief Data Scientist and Co-Founder, Haystacks.AI; Vivian Zhang, Chief Technology Officer, New York City Data Science Academy.
Helpful resources we consulted:
Chauhan, N.S. DBSCAN Clustering Algorithm in Machine Learning. April, 2022. https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html
Data4Help. Clustering Real Estate Data. May, 2020. https://becominghuman.ai/clustering-real-estate-data-594894e24484
Jaadi, Z. A Step-by-Step Explanation of Principal Component Analysis (PCA): Learn how to use PCA when working with large data sets. September, 2022. https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Keany, E. The Ultimate Guide for Clustering Mixed Data. November, 2021. https://medium.com/analytics-vidhya/the-ultimate-guide-for-clustering-mixed-data-1eefa0b4743b
Li, S. Sharpen Your Machine Learning Skills with This Real-World Housing Market Cluster Analysis. November, 2021. https://towardsdatascience.com/sharpen-your-machine-learning-skills-with-this-real-world-housing-market-cluster-analysis-f0e6b06f6ba0