End-to-End Machine Learning Pipeline for Real Estate Valuation & Recommendation Engine

Nawaraj Paudel, PhD

Posted on Nov 1, 2024

Overview

Real estate is one of the largest markets in the United States. The residential market alone comprises 146 million units valued at 43 trillion USD, and commercial real estate adds another 21 trillion USD to the real estate market. For perspective, as of November 2024, the S&P 500 has a total market cap of 45 trillion USD. However, stocks trade with far more frequency than real estate with only 2–8% of properties sold annually.

In a tight housing market with low inventory, a prop-tech intelligence system that estimates property value based on desired features provides a competitive edge for both investors and buyers. It helps determine if constructing new properties amid high interest rates and labor costs will yield desirable returns.

To capture this market potential, we have created a property recommendation engine and automated machine learning pipeline to train advanced models, such as CatBoost, LightGBM, AdaBoost, and RandomForest, along with a blending model of these four. CatBoost achieves an accuracy of 95%, allowing us to estimate property costs within a 5% margin of error.

The goal of this project was to build a scalable automated machine learning pipeline. We discuss our real estate and property valuation findings based on these models in detail below.

Data Ingestion, Exploration, & Understanding

The data was collected by De Cock (2011) which contains 80 features and 2930 observations. It has 37 numerical features, 43 categorical features, and the target variable sale price. Many features contain missing values, with even more than 95% missing data for some of the features.

Figure 1 highlights the top 10 features with the highest percentage of missing values. Understanding the reasons behind these missing values requires contextual analysis. For instance, the missing values for 'PoolQC' could be attributed to properties lacking a pool, thus rendering the pool quality feature irrelevant.

All numerical features with missing values were imputed using the median, as they were missing completely at random (MCAR). For categorical features, those with less than 10% missing values were filled with the most frequent category. If more than 10% of the values were missing, a new category called 'Unknown' was created. After handling missing values, infrequent categories (those accounting for less than 12%) were merged into a new category called 'Others'.

Custom handling of these categorical features was necessary due to the extensive dataset that encompasses 80 features, including 43 categorical features with numerous subcategories. This approach ensures that, after one-hot encoding, dimensionality remains manageable. Also, each subcategory has sufficient data for the model to learn from, helping it generalize well to unseen data.

To understand the seasonality of real estate transactions, we analyzed the houses sold by month for the time range of 2006 to 2010.

bymonthsale — Figure 2: The number of houses sold by month from 2006 to 2010

Figure 2 illustrates that the majority of transactions occur during the summer months each year. There is a noticeable decline in sales at both the beginning and end of the year.

To assess house prices by specific categorical features, such as Neighborhood', and determine their significance in our modeling, we analyzed the sale prices of houses across various neighborhoods.

saleprice — Figure 3: Top 10 neighborhoods with the highest median prices

Figure 3 illustrates that house prices vary significantly by neighborhood. 'StoneBr' and 'NridgHt' boast the highest median prices, whereas 'Greens' and 'CollegeCr' have the lowest.

Feature Engineering & Selection

From the existing dataset, we derived four new features: house age, total square footage, number of bathrooms, and years since the last remodel. For numerical features, selection was performed using `f_regression` from the `sklearn` library, along with correlation analysis to avoid multicollinearity, in line with the tolerable variation inflation factor (VIF).

num_corr — Figure 4: Correlation heatmap of original and engineered features

Figure 4 shows that engineered features like total bath are highly correlated with features like full bath and half bath from which it was derived. Using `f_regression`, the top 10 important features were selected. These had VIF lower than 5, indicating no multicollinearity issues as shown in Table 1.

Table 1: Variance Inflation Factor (VIF) Analysis for Top 10 Most Important Numerical Features in House Price Prediction

Feature	VIF Score
TotalSqFt	3.536970
HouseAge	3.389633
GarageAge	2.955172
OverallQual	2.711003
TotalBaths	2.097282
TotRmsAbvGrd	1.958635
YrRemodAge	1.909756
GarageCars	1.893344
Fireplaces	1.401744
MasVnrArea	1.357075
LotFrontage	1.356635

For categorical features, the association was examined using the Chi-square test and Cramer's V.

cat_cramer — Figure 5: Cramer's V association values for features with a threshold of 0.45

Figure 5 illustrates that certain features, like 'Exterior1st', are strongly associated with 'Exterior2nd'. Other features, such as 'Neighborhood', exhibit moderate to strong associations with multiple other features. For our modeling, we selected the top four most significant categorical features: 'Neighborhood', 'FireplaceQu', 'KitchenQual', and 'BsmtExposure'.

Streamlined Pipeline: From Data Loading to Model Hyper-tuning

We implemented a robust machine learning pipeline following industry best practices for real estate price prediction. We start with data preprocessing where we handle missing values, normalize categorical features, and engineer domain-specific features like total square footage, house age, total baths, and years since the house was remodeled. All preprocessing steps (imputation, standardization, one-hot encoding) are carefully sequenced to prevent data leakage, with parameters learned only from training data and stored in a preprocessing pipeline for future use.

The core modeling phase leverages an ensemble of advanced algorithms (CatBoost, LightGBM, Random Forest, and AdaBoost) with cross-validation and `GridSearchCV` for hyperparameter optimization. Each model is evaluated using multiple metrics -- R², Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) -- to ensure robust performance, with early stopping mechanisms preventing overfitting.

feature_importance_catboost — Figure 7: Top 20 most important features identified by hyper-tuned CatBoost regressor

The feature importance analysis shown in Figure 7 shows important insights into key price drivers. As we see, 'TotalSqFt' and 'OverallQual' account for nearly 50% of the influence. Model persisting (pickling) enables seamless deployment by saving both the preprocessing pipeline and trained models. This allows easy transformation of new data and generation of predictions in production environments. The final ensemble model combines individual model strengths to deliver accurate and reliable price predictions, achieving R² scores above 0.90 on validation data.

Recommendation Engine

Our recommendation engine implements a nearest-neighbor approach to match properties based on user preferences and property characteristics. The system leverages our robust data transformation pipeline to create a rich feature space for property matching. At its core, the recommendation system utilizes `scikit-learn's NearestNeighbors` algorithm, which operates on transformed and normalized property features including our engineered metrics (TotalSqFt, HouseAge, TotalBaths, YrRemodAge) and processed categorical variables.

The recommendation process begins by transforming raw property data through our custom `DataTransformer`, which handles both numerical and categorical features with careful preprocessing thresholds. When a user inputs specific filters (such as price range or neighborhood preferences), the system identifies matching properties and uses the `NearestNeighbors` algorithm to find the most similar properties based on multidimensional feature similarity. This similarity computation takes into account all transformed features, weighted appropriately through our preprocessing pipeline.

The system returns a customizable number of similar properties that are ranked by similarity score. This makes it easy for users to explore alternatives that closely match their preferences and also brings comparable properties they might have overlooked to their attention. This implementation provides a balance between accuracy and computational efficiency, enabling real-time property recommendations in a production environment.

Conclusion

Our ML pipeline demonstrates seamless predictive capabilities in real estate price estimation, with the ensemble approach consistently achieving accuracy above 0.90 across different market segments. With seven-fold cross-validation, our models consistently generated strong performance metrics, with R² scores as high as 0.94, indicating good predictive ability and generalization capacity.

The practical implications of these results are significant - our models maintain a Mean Absolute Error (MAE) ranging from $12,300 to $17,900, representing approximately a 5% error margin on predictions. This level of accuracy is particularly impressive given the small data set with sparse data for some groups of categorical feature combinations. The final ensemble model, which assigns equal weights to all four tuned models, offers a robust and reliable solution for predicting real estate prices across various property types and market conditions. This performance, coupled with our automated pipeline's ability to handle new data, makes it a valuable tool for real estate professionals and investors.

Table 2: Hyperparameter Tuned Model Performance
Model	Train R2	Test R2	Train MAE	Test MAE	Train MAPE	Test MAPE	Training Time
CatBoost	0.9362	0.9386	11091.4807	12893.8678	0.075036	0.080996	2.41s
LightGBM	0.9371	0.9280	9783.1770	13702.3802	0.067831	0.087694	0.12s
RandomForest	0.9540	0.9242	10225.5118	14146.3613	0.067312	0.089740	1.70s
AdaBoost	0.8378	0.8526	20584.1489	20969.9087	0.133344	0.125758	0.21s
Ensemble	0.9364	0.9295	12071.4640	14215.5670	NaN	NaN	N/A

CatBoost was the best model among our four models but with very high training time, leveraging its advanced gradient boosting architecture and superior categorical feature handling. The model achieved the highest R² scores and showed remarkable stability across different validation folds. While LightGBM showed slightly lower comparative performance but with much lower training time.

shap_catboost — Figure 8: The SHAP (SHapley Additive exPlanations) summary plot for CatBoost regressor

The SHAP (SHapley Additive exPlanations) value plot as shown in Figure 8 demonstrates how different features impact house price predictions using the CatBoost model. Key influential features include TotalSqft, which has the strongest impact and can increase prices up to $60,000 USD, and OverallQual, where higher quality significantly boosts prices. TotalBaths, along with YrRemodAge and HouseAge, also play crucial roles, with newer or recently remodeled homes generally commanding higher prices. Moderate impact features such as GarageCars, Fireplaces, and LotFrontage show positive correlations with price increases, although their effects are less significant.

Additionally, categorical features like Neighborhood, BsmtExposure, FireplaceQu, and KitchenQual exhibit smaller individual impacts. Red color in the plot indicates higher feature values, while blue represents lower values, helping visualize the range of impacts on price predictions. This analysis aids in understanding the relative importance and directional impact of features on house price predictions, supporting more informed real estate decisions.

Potential Directions

The dynamics of several industries, including real estate, are changing as a result of the development of Large Language Models (LLMs). In prop-tech, smart real estate (intelligent buildings and cities), con-tech (construction startups), real estate fintech, and the collaborative economy, the emphasis is now on utilizing these new capabilities.

For these markets, offering AI/ML as Software as a Service (SaaS) can have a big impact. LLMs, for example, can be used to scan legal and property documents and extract useful information. These algorithms can forecast the risk of foreclosure and buyer preparedness after a property is listed for sale. We can evaluate property photos for damage using image analysis.

In collaboration with banks and mortgage lenders, we might offer proactive real estate services by building a database to track mortgage defaults and other relevant data. With previously unheard-of insights and efficiency, LLMs and state-of-the-art AI and ML technology will completely transform the real estate sector.

If you enjoyed reading my blogpost, please follow and connect me on LinkedIn for collaboration, networking, and more insightful content.

Quick Links

GitHub Repository
LinkedIn Profile
Click here to watch my presentation

About Author

Nawaraj Paudel, PhD

Data Science leader with a PhD in Quantitative Modeling and close to a decade of experience driving high-impact analytics initiatives. Proven track record of leveraging machine learning, deep learning, NLP, and data engineering to optimize business performance, improve...

View all posts by Nawaraj Paudel, PhD >

No comments found.

End-to-End Machine Learning Pipeline for Real Estate Valuation & Recommendation Engine

Overview

Data Ingestion, Exploration, & Understanding

Feature Engineering & Selection

Streamlined Pipeline: From Data Loading to Model Hyper-tuning

Recommendation Engine

Conclusion

Potential Directions

Quick Links

About Author

Nawaraj Paudel, PhD

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

End-to-End Machine Learning Pipeline for Real Estate Valuation & Recommendation Engine

Overview

Data Ingestion, Exploration, & Understanding

Feature Engineering & Selection

Streamlined Pipeline: From Data Loading to Model Hyper-tuning

Recommendation Engine

Conclusion

Potential Directions

Quick Links

About Author

Nawaraj Paudel, PhD

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!