NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Meetup > Optimizing Real Estate Price Prediction with Unsupervised Learning

Optimizing Real Estate Price Prediction with Unsupervised Learning

Trevor Mattos
Posted on Oct 12, 2022

A project with New York City Data Science Academy and Haystacks.AI. Code available on GitHub.

Real Estate Investing and the Single Family Rental Market

Real estate investment offers distinct advantages to both individual and institutional investors, including tax benefits, cash flow, and a hedge against inflation, just to name a few. For investors interested in residential real estate, the fast-growing Single Family Rental (SFR) market has become an increasingly attractive option.

According to Roofstock, more than one-third of rental units in the U.S. are SFRs, with the total number of SFRs having jumped 25 percent since 2012. As rents and home prices demonstrate strong year-over-year growth, investors face a unique value proposition. Unlike multifamily residential properties, investors can acquire SFR properties with less money down and lower interest rates. 

But amid a national housing shortage, with pent-up demand and low supply, building a strong residential real estate portfolio requires a sharp eye for smart buysโ€”or a sophisticated predictive model. Here we focus on unsupervised learning techniques to improve the performance of a basic Automated Valuation Model (AVM). 

The primary objective of this exercise was to improve price prediction by first using unsupervised learning techniques to cluster real estate listings. To this end, our workflow included the following steps, which are the focus of this article:

  • Data cleaning
  • Merging supplemental data
  • Feature engineering
  • Dimensionality reduction
  • Clustering with k-means
  • Clustered prediction with Random Forest

TL;DR: The analysis revealed a modest improvement in price prediction using clustered data. In particular, our smallest cluster, which is a dense urban cluster, demonstrated the best performance with respect to prediction. We found that there are several tradeoffs investors should be aware of when making decisions about investments in this particular cluster. Urban density brings with it relatively high property prices and greater proximity to amenities. But there are also higher crime rates and a high price per square foot. Insights like this could be useful for investors and their advisors, as well as those in the Property Technology (PropTech) industry.

Data Cleaning and Exploratory Data Analysis

Starting with 31,064 property listings for the state of Georgia, sourced from various platforms (e.g., Zillow, Redfin) and provided to us by colleagues at Haystacks.AI, we first filtered detached single-family properties. Then, we dropped listings with zero square footage and columns that were no longer relevant (like unit count) or had too many missing values (like number of half baths). After dropping outliers and imputing missing data the clean dataset included 11,491 unique listings.

As a next step, we added supplemental data to the original dataset to increase the number of features. The idea here was that including more features could help improve clustering and prediction down the line. Supplemental data on schools and crime came from Haystacks.AI. But to produce additional useful features, we scraped โ€˜high qualityโ€™ grocery stores (for simplicity we defined this as Whole Foods Markets or Trader Joeโ€™s) data from Superpages Classifieds and walk score data from walkscore.com. To reduce time scraping walk scores, we did so for zip codes rather than individual property listings.  Finally, we included bank branch location data sourced from the Federal Deposit Insurance Corporation (FDIC). 

Using longitude and latitude data, we computed the haversine distance to the nearest school, bank, and grocery store for each property listing. Walk score data and crime rates corresponded to zip codes and so we simply merged the data on the zip code column. Altogether, supplemental data added 20 new features to our dataset, 16 of which were detailed crime rates at the zip code level.

Before moving on to the modeling, we did some basic visualizations to get a sense for underlying correlations and do a quick sanity check. Plotting price against square footage we saw a positive correlation, with most properties falling below 4,000 square feet and $1 million.

As we might expect, the data showed a negative correlation between property price and crime. Here weโ€™ve plotted the violent crime rate (per 1,000 residents) against price, revealing a weak negative relationship. All of this makes sense and we can rest assured there are no major outliers to skew our modeling.

Dimensionality Reduction and Clustering

To reduce dimensionality prior to clustering, we explored a few different techniques. First, we used Principal Components Analysis (PCA) to reduce dimensionality among numerical features. The results showed that the first three components explained roughly 58 percent of the variance in the dataset (good not great). 

Next, we used Multiple Correspondence Analysis (MCA) to reduce dimensionality among categorical features. Because the categorical features included various geographic variables, such as county, city, and zip code, which each included hundreds of unique values, MCA did not handle the categorical data very well. Individual components returned by MCA explained less than 1 percent of the inertia (a measure comparable to explained variance with PCA). Including only a subset of non-geographic categorical features improved the MCA results a bit, so we attempted to include them. To do so we brought numerical and categorical features together with a generalized PCA algorithm, a technique called Factor Analysis of Mixed Data (FAMD). For details related to FAMD calculations, please refer to this article, which we consulted for our analysis. As it turned out, FAMD results still fell short of the original dimensionality reduction using PCA on numerical features. For this reason, we excluded categorical data from clustering and prediction models. 

The next step was to cluster the data. While we did experiment with a few different clustering algorithms, including DBSCAN (which is well suited to geographic data), in the end we obtained the best clustering results using k-means.  

The separation between the larger blue and green clusters is not very pronounced. But the purple cluster stands out quite a bit, particularly on the axis for principal component one. By looking back to the PCA results, we could see that the first component was highly correlated with crime data, a fact that became important when we began developing our supervised models. With more features, or perhaps a more robust clustering algorithm, we may have been able to develop better clusters. For this exercise, however, we wanted to see what we could accomplish with the features we already had. And it turned out that even these clusters improved prediction.

The Impact of Clustering on Price Prediction

To test the performance of predictive models on clustered data, we first mapped our cluster labels back to our original data (keeping only the numeric features utilized in PCA and clustering). Then we train test split our full dataset (setting aside 30 percent of our data for testing our models), and further segmented our training and testing data by cluster. In the end each of our three clusters had training and testing features, and training and testing price data (our target). The size of our clustered data ranged between several hundred and several thousand property listings for training and test data. 

Given that crime rates proved important to principal component one, which played a significant role in separating our data into clusters, it was important to use crime data in our prediction models. However, among the 16 columns of crime rate data there were several that were highly correlated, raising concerns about multicollinearity. For this reason it made the most sense to not use a linear model. Since tree-based models do not assume linearly independent features, we decided to fit a Random Forest Regression model to predict prices. 

Knowing that our objective was not to necessarily build the best predictive model, but rather to demonstrate the performance differential between clustered prediction and standard prediction on the full dataset, we did not spend much time tuning the model. All random forest models used 100 trees with a maximum depth of 4. 

The results: Among our cluster models, coefficients of determination ranged from 0.517 to 0.637 on test data, with the best performance on the smallest cluster (which had the best cluster separation). The average test score on cluster models was 0.569, compared to 0.539 for the standard model. The difference is not large, but is reasonable considering the relatively limited data and modest results from PCA and clustering. With more data and better clustering, we could expect even greater improvement in prediction when leveraging clustering techniques.

Understanding the Clusters

Because k-means clustering relies on Euclidean distance, which is not appropriate for map coordinates, we left out longitude and latitude data. Even still, we did include three distance measures: distances to the nearest bank, grocer, and school for each property listing. These features, along with others representing property and neighborhood characteristics yielded clusters that follow a fairly intuitive geographic distribution. 

The large blue cluster labeled outer ring cluster forms a large ring around the Atlanta metropolitan area, extending deep into the suburbs across the state. By contrast, the small purple cluster labeled the downtown cluster represents the urban center of the city of Atlanta (with a few scattered properties in Savannah and elsewhere). The green cluster labeled inner ring cluster has perhaps the worst separation of the three, with diffuse representation across the map, often overlapping with the blue cluster. It would appear that the green cluster captures inner ring suburbs with moderate urban density, as well as properties scattered across the metropolitan area and rural areas as well.

The outer ring cluster is consistent with a desirable suburb, having the highest prices and lowest crime rates. This contrasts with the downtown cluster, which has relatively lower property prices and much higher crime rates. The downtown area also comes with an expected premium on space, having the highest price per square foot of the three clusters. 

Unsurprisingly, downtown properties benefit from close proximity to amenities and thus have the highest average walk score. The distance to a high quality grocery store is also a fraction of what it is outside downtown. Though not displayed here, distances to the nearest school and bank are also lowest in the downtown area.

Summing it Up

Unsupervised learning methods such as clustering algorithms can often feel a bit enigmaticโ€”there is really no one right answer. But they can also unearth invaluable insights and optimize performance when paired with more familiar supervised learning models. The goal of this exercise was to demonstrate how clustering can help us identify and understand natural groupings among real estate listings, and then leverage our clusters to improve price prediction. Despite having relied heavily on alternative data (as opposed to traditional financial metrics), we still produced reasonably insightful clusters and found modest improvement in prediction with random forests. This approach, when refined and implemented at scale, could be transformative for real estate investors and professionals in the PropTech space. 

Limitations and next steps

Given more time, our team would undertake several next steps: First, the modeling would benefit from more data. Adding additional points of interest and more granular data (address-level versus zip code level) could improve clustering. The clustered AVM technique may also benefit from more robust clustering algorithms and perhaps clustering of smaller geographic units (e.g., counties). While we did cluster with DBSCAN, it is possible that other approaches like agglomerative clustering or Gaussian Mixture Models would yield better results. Lastly, we may be able to achieve even better prediction results with more finely tuned and more sophisticated supervised learning models. 

Acknowledgements

I am grateful for the support and collaboration of the following people: Lauren Tomlinson, Data Science Fellow, New York City Data Science Academy; Joe Lee, Chief Data Scientist and Co-Founder, Haystacks.AI; Vivian Zhang, Chief Technology Officer, New York City Data Science Academy.

Helpful resources we consulted:

Chauhan, N.S. DBSCAN Clustering Algorithm in Machine Learning. April, 2022. https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html 

Data4Help. Clustering Real Estate Data. May, 2020. https://becominghuman.ai/clustering-real-estate-data-594894e24484 

Jaadi, Z. A Step-by-Step Explanation of Principal Component Analysis (PCA): Learn how to use PCA when working with large data sets. September, 2022. https://builtin.com/data-science/step-step-explanation-principal-component-analysis 

Keany, E. The Ultimate Guide for Clustering Mixed Data. November, 2021. https://medium.com/analytics-vidhya/the-ultimate-guide-for-clustering-mixed-data-1eefa0b4743b

Li, S. Sharpen Your Machine Learning Skills with This Real-World Housing Market Cluster Analysis. November, 2021. https://towardsdatascience.com/sharpen-your-machine-learning-skills-with-this-real-world-housing-market-cluster-analysis-f0e6b06f6ba0

 

 

About Author

Trevor Mattos

Data scientist with an MA in Applied Economics, interested in deriving novel insights with data.
View all posts by Trevor Mattos >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application