NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Boosting Real Estate Decisions

Boosting Real Estate Decisions

Brian Drewes
Posted on Feb 21, 2024

How can we advise better real estate purchasing decisions in Ames, Iowa?

Introduction

Buying a house can be a life-changing decision for a first-time home buyer, or it could just be another business purchase for house flippers. Even though these purchasing decisions have been around long before the emergence of widespread computer usage and machine learning, we can utilize technologies to enhance the choices we make when it comes to substantial investments.

Dean De Cock used this data as a replacement for the traditional Boston Housing Dataset in regression classes. His original study forms the basis of the dataset, though I augmented it with additional records (2580 rows x 82 columns) to help with training the ML models. I used Python in Jupyter Notebooks for most of the technical work in this study and present a short introduction for replicating this study in Dataiku at the end. 

Data Cleansing & Pre-Processing

The data wasnโ€™t ready to use as it was because of missing or duplicated data. Several features included a considerable number of null values, and some points were duplicated. The data was cleaned with a few methods, such as dropping certain features due to null percent, imputed medians, and duplicate removal. For a full overview of how I cleaned the data, view the presentation slides or my Github.

For certain linear models, data must be converted into dummy variables to enable the model to interpret the data correctly, while also dropping a column to avoid multicollinearity. I also wanted to test the differences between the different algorithms with respect to dummification and dropping a column, as well as ordinal with label encoding. I separate my testing into three data transformations: ordinal, dummified with a column dropped, and dummified with no column dropped.

In addition to each data transformation, I wanted to see the influence of outliers on the future models. To that end, I separated each data transformation into three ways of filtering outliers. 

โ€œ3 * IQRโ€ -> removes any outliers past the IQR range multiplied by 3, a typical removal is 1.5, but out of personal judgment I went with 3. 

โ€œAll Outliersโ€ -> this is the base case compared to โ€œ3 * IQRโ€ and โ€œOnly Normalโ€ neither of these methods are performed and we just use the original data with no outliers removed after the cleansing and data transformations.

โ€œOnly Normalโ€ -> at the recommendation of Dean De Cock in his research, we only keep the houses that were sold as a normal sale condition. This filters out sales such as foreclosures, family deals, etc.

Below is the visualization of the data transformations and outlier filtering process described above: 

Feature Engineering

Feature engineering is an important piece in capturing more variables from data and that could contribute to the accuracy of the models. Some of the features I added into the project are:

  • โ€˜TotalHouseSFโ€™, which is a combination of 1st, 2nd and basement square footage. 
  • โ€˜QualityOutdoorSFโ€™, which is a combination of deck and porch square footage. 
  • โ€˜TotalBathroomCountโ€™, creating a total of half bathrooms as 0.5 and full bathrooms as 1.0.

Additional features applied can be found in my presentation slides and GitHub.

Model Building

The algorithms and regression techniques used in the study include MLR, Lasso, Ridge, ElasticNet, GradientBoosting, RandomForest, XGBoost, and CatBoost. The models were created from the previous processes in addition to tuned models, which came out to more than 84 models. A quick snapshot of the scores dataframe below shows scoring was done with 5 K-Fold means of R Squared and RMSE values.

Hyperparameter Tuning

If you noticed in the previous figures, I included some models that include โ€˜_tunedโ€™. In order to achieve higher scoring models, I utilized hyperparameter tuning. AWS explains its function this way: 

Hyperparameters directly control model structure, function, and performance. Hyperparameter tuning allows data scientists to tweak model performance for optimal results. This process is an essential part of machine learning, and choosing appropriate hyperparameter values is crucial for success.

I explored two packages for this purpose: scikit-learnโ€™s GridSearchCV and Optuna.

I ultimately found that, for XGBoost in my data, Optuna performed better than GridSearchCV with a higher 5 K-Fold Mean R Squared, lower RMSE, and quicker run time. Optuna utilizes an adaptive Bayesian search to find its optimal parameters and has built in visualizations. If you intend to use Optuna for your models, just bear in mind that it is only compatible with Python and is not a part of the scikit-learn package.

Below is a quick view of a few visualizations Optuna provides on its hyperparameter importances:

Stacked Model

โ€œThe whole is greater than the sum of its parts.โ€ -Aristotle

I explored further methods of enhancing the predictions. A stacked ensemble model is a combination of models that can utilize the various predictive strengths in a single model. For this study, I took my best two performers alongside a Ridge model; the final estimator was LinearRegression. 

Not only did this achieve better scores in R Squared and RMSE, it also allowed me to use the strengths of my three different models in a single model. The pros of this approach include flexibility, predictive strength, and a robust solution. But itโ€™s not without its drawbacks.  The  cons of using an ensemble include low interpretability and complexity.

Best vs. Explainable Model

Above are the differences in scoring between the two models. To achieve the highest possible predictability, I would opt for a variation of the stacked model. I figure this would be best in a situation where I donโ€™t specifically need to speak to the importances of each individual feature, such as a user application, and am just looking for a quick housing price.

For the purposes of assisting individual home buyers or house flippers, I would opt for the more explainable model. In other words, I would be willing to sacrifice some predictability for better interpretability. I believe in that situation it would be better and easier to explain the importance of specific features to those who arenโ€™t well-versed in the technical aspects of data modeling. For these reasons, I chose the more explainable model for this project.

XGBoost, which was tuned with Optuna, transformed ordinally, with a normal sale condition only. Although the ordinal model performed a bit worse than others, it is easier to explain the features when they are combined rather than split. I chose the normal sales condition both  because it performed well and is probably in line with what someone would encounter. A family sale, foreclosure, or anything similar would be handled as a special case.

In the graph above on the left, you can see the Predicted Values vs. Actual Values. In a home buying advisory, I would initially lean towards those values furthest under the red dotted line, as the predicted values are lower than the actual values of the properties. Similarly, the Residual plot on the right, shows the values of residual error in respect to the red dotted line.

Feature Evaluation

To evaluate the XGBoost model, I installed and used the, SHapley Additive exPlanations (SHAP), a game theoretic approach to explain the output of any machine learning model. You can read more about the specifics of SHAP here.

In the above model, you can see that OverallQual had the most impact on SalePrice. Overall quality would be our most important feature here, and the goal would be to increase it. This would lead us to the next step of our analysis, identifying which features have the greatest impact on our model. In the visualizations below, the further to the red would increase positive value and vice versa for the blue values.

Here is an example of using SHAP value dependence to view a plausible business case derived from the model. On the left, we see a value somewhere around 2750 square feet in unfinished basement land around -$14,000 value for the house. Whereas if we adjusted the basement to be finished with around the same amount of square footage, it increases to around $21,000. This could be a potential $35,000 swing in SalePrice from fixing up the basement!

For a deeper dive on specific feature relationships like this, take a look at the presentation. More are displayed in my GitHub.

Recommendations

The primary focus is on increasing Overall Quality for the highest change in price. Additional improvements that correlate with higher sale prices include the following:

  1. Increasing total house square footage (through 1st, 2nd, and basement floor renovations)
  2. Increasing the greater living area 
  3. Finishing an unfinished basement can create a considerable change in price

Prices can also increase as a result of adding in sought-after features For example::

  1. Every 0.5 bathroom after 2 will give you more value, until 5 bathrooms
  2. A fireplace is better than no fireplace, and 3 is better than 1
  3. Adding a garage, porches, or decks (over thresholds) also increase sale prices.

Itโ€™s important to always assess the cost-benefit of the improvements you undertake with the goal of selling the home. If the improvement costs less than the model's prediction ($907) per 100 sq ft of additions to the greater living area, great!

If it costs more, you have to answer these questions:

  • Is it part of a larger project that will add total sqft?
  • Will this increase my overall quality?
  • Am I including a bathroom that brings me over 2 to 2.5? 

Limitations

  • Not knowing how overall quality is measured. (Or some of the other features)
  • Amount of data may not be optimal for some algorithms.
  • Housing data is just in Ames, Iowa for the given time period.

Future works

  • Using the PID to pull in pictures or other information to assist in predicting, De Cock mentions that this may be possible with certain sources.
  • Compared to other housing markets, replicate the work and see differences elsewhere. 
  • Application or interface for a housing agent or home buyer to view these findings interactively.

Replicating the work with Dataiku 

(I am not affiliated with Dataiku, and I used a free trial to replicate my work in this project on my own)

To show the importance of using tools to streamline tasks, I demonstrate a couple of features of Dataiku that enabled me to do the same project within a few hours. These images may be clearer in the presentation, but Dataiku enables users to visually identify null values in the data and replace them easily while maintaining an audit trail. I did half of the cleaning with Dataiku features and no-code features. I also copy/pasted the code from my notebook to demonstrate the coding potential inside of Dataiku as well.

After the data was cleaned and filtered for outliers, I used the AutoML capabilities of Dataiku to reinforce my results. I was pleasantly surprised to find that the individual models performed near the level of the models in my manual hyperparameter tuning. Although they had slightly lower scores, I didnโ€™t spend much time tuning or testing different ways to create newer models. This reassured me that gradient boosting algorithms for the models were the best performers. 

Links / Accreditations

  • Feature Image Created with Image Generator GPT by NAIF J ALOTAIBI on ChatGPT4
  • Google Slides Deck (Last slide has sources used)
  • GitHub
  • LinkedIn

About Author

Brian Drewes

Coming from a customer-facing role at an AI/ML software company, I'm driven to understand data science challenges that organizations face. Leveraging my background in Economics, my quantitative skills and sales-honed communication, I aim to fuse these proficiencies into...
View all posts by Brian Drewes >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application