Optimizing Real Estate Price Prediction with Unsupervised Learning

Posted on Oct 12, 2022

A project with New York City Data Science Academy and Haystacks.AI. Code available on GitHub.

Real Estate Investing and the Single Family Rental Market

Real estate investment offers distinct advantages to both individual and institutional investors, including tax benefits, cash flow, and a hedge against inflation, just to name a few. For investors interested in residential real estate, the fast-growing Single Family Rental (SFR) market has become an increasingly attractive option.

According to Roofstock, more than one-third of rental units in the U.S. are SFRs, with the total number of SFRs having jumped 25 percent since 2012. As rents and home prices demonstrate strong year-over-year growth, investors face a unique value proposition. Unlike multifamily residential properties, investors can acquire SFR properties with less money down and lower interest rates. 

But amid a national housing shortage, with pent-up demand and low supply, building a strong residential real estate portfolio requires a sharp eye for smart buys—or a sophisticated predictive model. Here we focus on unsupervised learning techniques to improve the performance of a basic Automated Valuation Model (AVM). 

The primary objective of this exercise was to improve price prediction by first using unsupervised learning techniques to cluster real estate listings. To this end, our workflow included the following steps, which are the focus of this article:

  • Data cleaning
  • Merging supplemental data
  • Feature engineering
  • Dimensionality reduction
  • Clustering with k-means
  • Clustered prediction with Random Forest

TL;DR: The analysis revealed a modest improvement in price prediction using clustered data. In particular, our smallest cluster, which is a dense urban cluster, demonstrated the best performance with respect to prediction. We found that there are several tradeoffs investors should be aware of when making decisions about investments in this particular cluster. Urban density brings with it relatively high property prices and greater proximity to amenities. But there are also higher crime rates and a high price per square foot. Insights like this could be useful for investors and their advisors, as well as those in the Property Technology (PropTech) industry.

Data Cleaning and Exploratory Data Analysis

Starting with 31,064 property listings for the state of Georgia, sourced from various platforms (e.g., Zillow, Redfin) and provided to us by colleagues at Haystacks.AI, we first filtered detached single-family properties. Then, we dropped listings with zero square footage and columns that were no longer relevant (like unit count) or had too many missing values (like number of half baths). After dropping outliers and imputing missing data the clean dataset included 11,491 unique listings.

As a next step, we added supplemental data to the original dataset to increase the number of features. The idea here was that including more features could help improve clustering and prediction down the line. Supplemental data on schools and crime came from Haystacks.AI. But to produce additional useful features, we scraped ‘high quality’ grocery stores (for simplicity we defined this as Whole Foods Markets or Trader Joe’s) data from Superpages Classifieds and walk score data from walkscore.com. To reduce time scraping walk scores, we did so for zip codes rather than individual property listings.  Finally, we included bank branch location data sourced from the Federal Deposit Insurance Corporation (FDIC)

Using longitude and latitude data, we computed the haversine distance to the nearest school, bank, and grocery store for each property listing. Walk score data and crime rates corresponded to zip codes and so we simply merged the data on the zip code column. Altogether, supplemental data added 20 new features to our dataset, 16 of which were detailed crime rates at the zip code level.

Before moving on to the modeling, we did some basic visualizations to get a sense for underlying correlations and do a quick sanity check. Plotting price against square footage we saw a positive correlation, with most properties falling below 4,000 square feet and $1 million.

As we might expect, the data showed a negative correlation between property price and crime. Here we’ve plotted the violent crime rate (per 1,000 residents) against price, revealing a weak negative relationship. All of this makes sense and we can rest assured there are no major outliers to skew our modeling.

Dimensionality Reduction and Clustering

To reduce dimensionality prior to clustering, we explored a few different techniques. First, we used Principal Components Analysis (PCA) to reduce dimensionality among numerical features. The results showed that the first three components explained roughly 58 percent of the variance in the dataset (good not great). 

Next, we used Multiple Correspondence Analysis (MCA) to reduce dimensionality among categorical features. Because the categorical features included various geographic variables, such as county, city, and zip code, which each included hundreds of unique values, MCA did not handle the categorical data very well. Individual components returned by MCA explained less than 1 percent of the inertia (a measure comparable to explained variance with PCA). Including only a subset of non-geographic categorical features improved the MCA results a bit, so we attempted to include them. To do so we brought numerical and categorical features together with a generalized PCA algorithm, a technique called Factor Analysis of Mixed Data (FAMD). For details related to FAMD calculations, please refer to this article, which we consulted for our analysis. As it turned out, FAMD results still fell short of the original dimensionality reduction using PCA on numerical features. For this reason, we excluded categorical data from clustering and prediction models. 

The next step was to cluster the data. While we did experiment with a few different clustering algorithms, including DBSCAN (which is well suited to geographic data), in the end we obtained the best clustering results using k-means.  

The separation between the larger blue and green clusters is not very pronounced. But the purple cluster stands out quite a bit, particularly on the axis for principal component one. By looking back to the PCA results, we could see that the first component was highly correlated with crime data, a fact that became important when we began developing our supervised models. With more features, or perhaps a more robust clustering algorithm, we may have been able to develop better clusters. For this exercise, however, we wanted to see what we could accomplish with the features we already had. And it turned out that even these clusters improved prediction.

The Impact of Clustering on Price Prediction

To test the performance of predictive models on clustered data, we first mapped our cluster labels back to our original data (keeping only the numeric features utilized in PCA and clustering). Then we train test split our full dataset (setting aside 30 percent of our data for testing our models), and further segmented our training and testing data by cluster. In the end each of our three clusters had training and testing features, and training and testing price data (our target). The size of our clustered data ranged between several hundred and several thousand property listings for training and test data. 

Given that crime rates proved important to principal component one, which played a significant role in separating our data into clusters, it was important to use crime data in our prediction models. However, among the 16 columns of crime rate data there were several that were highly correlated, raising concerns about multicollinearity. For this reason it made the most sense to not use a linear model. Since tree-based models do not assume linearly independent features, we decided to fit a Random Forest Regression model to predict prices. 

Knowing that our objective was not to necessarily build the best predictive model, but rather to demonstrate the performance differential between clustered prediction and standard prediction on the full dataset, we did not spend much time tuning the model. All random forest models used 100 trees with a maximum depth of 4. 

The results: Among our cluster models, coefficients of determination ranged from 0.517 to 0.637 on test data, with the best performance on the smallest cluster (which had the best cluster separation). The average test score on cluster models was 0.569, compared to 0.539 for the standard model. The difference is not large, but is reasonable considering the relatively limited data and modest results from PCA and clustering. With more data and better clustering, we could expect even greater improvement in prediction when leveraging clustering techniques.

Understanding the Clusters

Because k-means clustering relies on Euclidean distance, which is not appropriate for map coordinates, we left out longitude and latitude data. Even still, we did include three distance measures: distances to the nearest bank, grocer, and school for each property listing. These features, along with others representing property and neighborhood characteristics yielded clusters that follow a fairly intuitive geographic distribution. 

The large blue cluster labeled outer ring cluster forms a large ring around the Atlanta metropolitan area, extending deep into the suburbs across the state. By contrast, the small purple cluster labeled the downtown cluster represents the urban center of the city of Atlanta (with a few scattered properties in Savannah and elsewhere). The green cluster labeled inner ring cluster has perhaps the worst separation of the three, with diffuse representation across the map, often overlapping with the blue cluster. It would appear that the green cluster captures inner ring suburbs with moderate urban density, as well as properties scattered across the metropolitan area and rural areas as well.

The outer ring cluster is consistent with a desirable suburb, having the highest prices and lowest crime rates. This contrasts with the downtown cluster, which has relatively lower property prices and much higher crime rates. The downtown area also comes with an expected premium on space, having the highest price per square foot of the three clusters. 

Unsurprisingly, downtown properties benefit from close proximity to amenities and thus have the highest average walk score. The distance to a high quality grocery store is also a fraction of what it is outside downtown. Though not displayed here, distances to the nearest school and bank are also lowest in the downtown area.

Summing it Up

Unsupervised learning methods such as clustering algorithms can often feel a bit enigmatic—there is really no one right answer. But they can also unearth invaluable insights and optimize performance when paired with more familiar supervised learning models. The goal of this exercise was to demonstrate how clustering can help us identify and understand natural groupings among real estate listings, and then leverage our clusters to improve price prediction. Despite having relied heavily on alternative data (as opposed to traditional financial metrics), we still produced reasonably insightful clusters and found modest improvement in prediction with random forests. This approach, when refined and implemented at scale, could be transformative for real estate investors and professionals in the PropTech space. 

Limitations and next steps

Given more time, our team would undertake several next steps: First, the modeling would benefit from more data. Adding additional points of interest and more granular data (address-level versus zip code level) could improve clustering. The clustered AVM technique may also benefit from more robust clustering algorithms and perhaps clustering of smaller geographic units (e.g., counties). While we did cluster with DBSCAN, it is possible that other approaches like agglomerative clustering or Gaussian Mixture Models would yield better results. Lastly, we may be able to achieve even better prediction results with more finely tuned and more sophisticated supervised learning models. 


I am grateful for the support and collaboration of the following people: Lauren Tomlinson, Data Science Fellow, New York City Data Science Academy; Joe Lee, Chief Data Scientist and Co-Founder, Haystacks.AI; Vivian Zhang, Chief Technology Officer, New York City Data Science Academy.

Helpful resources we consulted:

Chauhan, N.S. DBSCAN Clustering Algorithm in Machine Learning. April, 2022. https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html 

Data4Help. Clustering Real Estate Data. May, 2020. https://becominghuman.ai/clustering-real-estate-data-594894e24484 

Jaadi, Z. A Step-by-Step Explanation of Principal Component Analysis (PCA): Learn how to use PCA when working with large data sets. September, 2022. https://builtin.com/data-science/step-step-explanation-principal-component-analysis 

Keany, E. The Ultimate Guide for Clustering Mixed Data. November, 2021. https://medium.com/analytics-vidhya/the-ultimate-guide-for-clustering-mixed-data-1eefa0b4743b

Li, S. Sharpen Your Machine Learning Skills with This Real-World Housing Market Cluster Analysis. November, 2021. https://towardsdatascience.com/sharpen-your-machine-learning-skills-with-this-real-world-housing-market-cluster-analysis-f0e6b06f6ba0



About Author

Trevor Mattos

Data scientist with an MA in Applied Economics, interested in deriving novel insights with data.
View all posts by Trevor Mattos >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI