NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > MLOPs > End-to-End Machine Learning Pipeline for Real Estate Valuation & Recommendation Engine

End-to-End Machine Learning Pipeline for Real Estate Valuation & Recommendation Engine

Nawaraj Paudel, PhD
Posted on Nov 1, 2024

Overview

Real estate is one of the largest markets in the United States. The residential market alone comprises 146 million units valued at 43 trillion USD, and commercial real estate adds another 21 trillion USD to the real estate market. For perspective, as of November 2024, the S&P 500 has a total market cap of 45 trillion USD. However, stocks trade with far more frequency than real estate with only 2โ€“8% of properties sold annually.

In a tight housing market with low inventory, a prop-tech intelligence system that estimates property value based on desired features provides a competitive edge for both investors and buyers. It helps determine if constructing new properties amid high interest rates and labor costs will yield desirable returns.

To capture this market potential, we have created a property recommendation engine and automated machine learning pipeline to train advanced models, such as CatBoost, LightGBM, AdaBoost, and RandomForest, along with a blending model of these four. CatBoost achieves an accuracy of 95%, allowing us to estimate property costs within a 5% margin of error.

The goal of this project was to build a scalable automated machine learning pipeline. We discuss our real estate and property valuation findings based on these models in detail below.

Data Ingestion, Exploration, & Understanding

The data was collected by De Cock (2011) which contains 80 features and 2930 observations. It has 37 numerical features, 43 categorical features, and the target variable sale price. Many features contain missing values, with even more than 95% missing data for some of the features.


# missing values
Figure 1: Top 10 features with the highest percentage of missing values

Figure 1 highlights the top 10 features with the highest percentage of missing values. Understanding the reasons behind these missing values requires contextual analysis. For instance, the missing values for 'PoolQC' could be attributed to properties lacking a pool, thus rendering the pool quality feature irrelevant.

All numerical features with missing values were imputed using the median, as they were missing completely at random (MCAR). For categorical features, those with less than 10% missing values were filled with the most frequent category. If more than 10% of the values were missing, a new category called 'Unknown' was created. After handling missing values, infrequent categories (those accounting for less than 12%) were merged into a new category called 'Others'.

Custom handling of these categorical features was necessary due to the extensive dataset that encompasses 80 features, including 43 categorical features with numerous subcategories. This approach ensures that, after one-hot encoding, dimensionality remains manageable. Also, each subcategory has sufficient data for the model to learn from, helping it generalize well to unseen data.

To understand the seasonality of real estate transactions, we analyzed the houses sold by month for the time range of 2006 to 2010.


bymonthsale
Figure 2: The number of houses sold by month from 2006 to 2010

Figure 2 illustrates that the majority of transactions occur during the summer months each year. There is a noticeable decline in sales at both the beginning and end of the year.

To assess house prices by specific categorical features, such as Neighborhood', and determine their significance in our modeling, we analyzed the sale prices of houses across various neighborhoods.


saleprice
Figure 3: Top 10 neighborhoods with the highest median prices

Figure 3 illustrates that house prices vary significantly by neighborhood. 'StoneBr' and 'NridgHt' boast the highest median prices, whereas 'Greens' and 'CollegeCr' have the lowest.

Feature Engineering & Selection

From the existing dataset, we derived four new features: house age, total square footage, number of bathrooms, and years since the last remodel. For numerical features, selection was performed using `f_regression` from the `sklearn` library, along with correlation analysis to avoid multicollinearity, in line with the tolerable variation inflation factor (VIF).


num_corr
Figure 4: Correlation heatmap of original and engineered features

Figure 4 shows that engineered features like total bath are highly correlated with features like full bath and half bath from which it was derived. Using `f_regression`, the top 10 important features were selected. These had VIF lower than 5, indicating no multicollinearity issues as shown in Table 1.

Table 1: Variance Inflation Factor (VIF) Analysis for Top 10 Most Important Numerical Features in House Price Prediction

Feature VIF Score
TotalSqFt 3.536970
HouseAge 3.389633
GarageAge 2.955172
OverallQual 2.711003
TotalBaths 2.097282
TotRmsAbvGrd 1.958635
YrRemodAge 1.909756
GarageCars 1.893344
Fireplaces 1.401744
MasVnrArea 1.357075
LotFrontage 1.356635

For categorical features, the association was examined using the Chi-square test and Cramer's V.


cat_cramer
Figure 5: Cramer's V association values for features with a threshold of 0.45

Figure 5 illustrates that certain features, like 'Exterior1st', are strongly associated with 'Exterior2nd'. Other features, such as 'Neighborhood', exhibit moderate to strong associations with multiple other features. For our modeling, we selected the top four most significant categorical features: 'Neighborhood', 'FireplaceQu', 'KitchenQual', and 'BsmtExposure'.

Streamlined Pipeline: From Data Loading to Model Hyper-tuning

We implemented a robust machine learning pipeline following industry best practices for real estate price prediction. We start with data preprocessing where we handle missing values, normalize categorical features, and engineer domain-specific features like total square footage, house age, total baths, and years since the house was remodeled. All preprocessing steps (imputation, standardization, one-hot encoding) are carefully sequenced to prevent data leakage, with parameters learned only from training data and stored in a preprocessing pipeline for future use.


pipeline
Figure 6: Preprocessing pipeline for data ingestion, validation, feature engineering, encoding, and standardization

The core modeling phase leverages an ensemble of advanced algorithms (CatBoost, LightGBM, Random Forest, and AdaBoost) with cross-validation and `GridSearchCV` for hyperparameter optimization. Each model is evaluated using multiple metrics -- Rยฒ, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) -- to ensure robust performance, with early stopping mechanisms preventing overfitting.


feature_importance_catboost
Figure 7: Top 20 most important features identified by hyper-tuned CatBoost regressor

The feature importance analysis shown in Figure 7 shows important insights into key price drivers. As we see, 'TotalSqFt' and 'OverallQual' account for nearly 50% of the influence. Model persisting (pickling) enables seamless deployment by saving both the preprocessing pipeline and trained models. This allows easy transformation of new data and generation of predictions in production environments. The final ensemble model combines individual model strengths to deliver accurate and reliable price predictions, achieving Rยฒ scores above 0.90 on validation data.

Recommendation Engine

Our recommendation engine implements a nearest-neighbor approach to match properties based on user preferences and property characteristics. The system leverages our robust data transformation pipeline to create a rich feature space for property matching. At its core, the recommendation system utilizes `scikit-learn's NearestNeighbors` algorithm, which operates on transformed and normalized property features including our engineered metrics (TotalSqFt, HouseAge, TotalBaths, YrRemodAge) and processed categorical variables.

The recommendation process begins by transforming raw property data through our custom `DataTransformer`, which handles both numerical and categorical features with careful preprocessing thresholds. When a user inputs specific filters (such as price range or neighborhood preferences), the system identifies matching properties and uses the `NearestNeighbors` algorithm to find the most similar properties based on multidimensional feature similarity. This similarity computation takes into account all transformed features, weighted appropriately through our preprocessing pipeline.

The system returns a customizable number of similar properties that are ranked by similarity score. This makes it easy for users to explore alternatives that closely match their preferences and also brings comparable properties they might have overlooked to their attention. This implementation provides a balance between accuracy and computational efficiency, enabling real-time property recommendations in a production environment.

Conclusion

Our ML pipeline demonstrates seamless predictive capabilities in real estate price estimation, with the ensemble approach consistently achieving accuracy above 0.90 across different market segments. With seven-fold cross-validation, our models consistently generated strong performance metrics, with Rยฒ scores as high as 0.94, indicating good predictive ability and generalization capacity.

The practical implications of these results are significant - our models maintain a Mean Absolute Error (MAE) ranging from $12,300 to $17,900, representing approximately a 5% error margin on predictions. This level of accuracy is particularly impressive given the small data set with sparse data for some groups of categorical feature combinations. The final ensemble model, which assigns equal weights to all four tuned models, offers a robust and reliable solution for predicting real estate prices across various property types and market conditions. This performance, coupled with our automated pipeline's ability to handle new data, makes it a valuable tool for real estate professionals and investors.

Table 2: Hyperparameter Tuned Model Performance
Model Train R2 Test R2 Train MAE Test MAE Train MAPE Test MAPE Training Time
CatBoost 0.9362 0.9386 11091.4807 12893.8678 0.075036 0.080996 2.41s
LightGBM 0.9371 0.9280 9783.1770 13702.3802 0.067831 0.087694 0.12s
RandomForest 0.9540 0.9242 10225.5118 14146.3613 0.067312 0.089740 1.70s
AdaBoost 0.8378 0.8526 20584.1489 20969.9087 0.133344 0.125758 0.21s
Ensemble 0.9364 0.9295 12071.4640 14215.5670 NaN NaN N/A

CatBoost was the best model among our four models but with very high training time, leveraging its advanced gradient boosting architecture and superior categorical feature handling. The model achieved the highest Rยฒ scores and showed remarkable stability across different validation folds. While LightGBM showed slightly lower comparative performance but with much lower training time.


shap_catboost
Figure 8: The SHAP (SHapley Additive exPlanations) summary plot for CatBoost regressor

The SHAP (SHapley Additive exPlanations) value plot as shown in Figure 8 demonstrates how different features impact house price predictions using the CatBoost model. Key influential features include TotalSqft, which has the strongest impact and can increase prices up to $60,000 USD, and OverallQual, where higher quality significantly boosts prices. TotalBaths, along with YrRemodAge and HouseAge, also play crucial roles, with newer or recently remodeled homes generally commanding higher prices. Moderate impact features such as GarageCars, Fireplaces, and LotFrontage show positive correlations with price increases, although their effects are less significant.

Additionally, categorical features like Neighborhood, BsmtExposure, FireplaceQu, and KitchenQual exhibit smaller individual impacts. Red color in the plot indicates higher feature values, while blue represents lower values, helping visualize the range of impacts on price predictions. This analysis aids in understanding the relative importance and directional impact of features on house price predictions, supporting more informed real estate decisions.

Potential Directions

The dynamics of several industries, including real estate, are changing as a result of the development of Large Language Models (LLMs). In prop-tech, smart real estate (intelligent buildings and cities), con-tech (construction startups), real estate fintech, and the collaborative economy, the emphasis is now on utilizing these new capabilities.

For these markets, offering AI/ML as Software as a Service (SaaS) can have a big impact. LLMs, for example, can be used to scan legal and property documents and extract useful information. These algorithms can forecast the risk of foreclosure and buyer preparedness after a property is listed for sale. We can evaluate property photos for damage using image analysis.

In collaboration with banks and mortgage lenders, we might offer proactive real estate services by building a database to track mortgage defaults and other relevant data. With previously unheard-of insights and efficiency, LLMs and state-of-the-art AI and ML technology will completely transform the real estate sector.

If you enjoyed reading my blogpost, please follow and connect me on LinkedIn for collaboration, networking, and more insightful content.

Quick Links

GitHub Repository
LinkedIn Profile
Click here to watch my presentation

About Author

Nawaraj Paudel, PhD

Data Science leader with a PhD in Quantitative Modeling and close to a decade of experience driving high-impact analytics initiatives. Proven track record of leveraging machine learning, deep learning, NLP, and data engineering to optimize business performance, improve...
View all posts by Nawaraj Paudel, PhD >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application