NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Meetup > Predicting Housing Prices in Ames, Iowa using Machine Learning Techniques

Predicting Housing Prices in Ames, Iowa using Machine Learning Techniques

Cheng Zhao
Posted on Oct 12, 2022

Introduction

Providing accurate valuation of home prices is an integral function of modern day online real-estate marketplace platforms such as Zillow and Redfin. Homebuyers rely on these estimations to gain quick insight into the current market and to plan their purchases while sellers refer to them to set expectations. Especially in the case of flippers and iBuyers (which Zillow ceased its operation in back in October 2021), having accurate estimations of the housing prices is likely the most important factor in being able to turn a profit. The task of accurately estimating housing prices is far from trivial since many factors can contribute to a shift in value and to various degrees. Therefore, it makes sense to utilize machine learning techniques to uncover the intricate relationships between the various features and the home value.

The goal of this project is take on the role of a real estate database company looking to arrive at a machine learning model that produces the most accurate housing price predictions. The dataset from the famous Kaggle competition on Ames housing price predictions will be used to evaluate the performance of various machine learning models as well as tunable parameters and the best combination will be used for the submission to the competition.

Data Cleaning and Feature Engineering

The dataset is comprised of a 50/50 split for training and testing set on around 3000 entries with 79 features. Based on the competition setup, the evaluation of the models will be performed on the training set and the testing set will only be used for the predictions submission.

As a first step, the training and testing set were combined and checked for duplicate entries. No duplicate entries were found.

It is also important to ensure the data type of the features are correct. All feature types are correct with the exception of "MSSubClass" which is a code to identify the type of dwelling and should be a categorical feature instead of numeric.

Resolving Missing Values/NAs

The following features had missing values and were resolved in the stated way.

  • Utilities (Type of utilities available): Impute with โ€œAllPubโ€ since itโ€™s the most common by far.
  • GarageYrBlt (Year garage was built): Impute with โ€œYearBuiltโ€ since they almost always coincide.
  • Exterior1st & Exterior2nd (Exterior coverings on house): Impute both with โ€œVinylSdโ€ since itโ€™s the most common.
  • MasVnrType & MasVnrArea (Masonry veneer type and area): For cases where both are NA, impute with None and 0. For one entry with area, impute type as โ€œBrkFaceโ€ since itโ€™s the most common.
  • Bsmt (Basement) variables: Two entries with no basement but values are NA instead of 0, change to 0.
  • Functional (Home functionality): Impute with โ€œTypโ€ since itโ€™s the most common.
  • KitchenQual (Kitchen quality): Impute with โ€œTAโ€ since itโ€™s the most common.
  • GarageCars & GarageArea: Replace with 0.

LotFrontage (Linear feet of street connected to property) had a very high number of NAโ€™s (259 in training set and 227 in testing set). Simply replacing NAโ€™s with 0 seems to be a bad idea, especially since LotArea is not 0 and every LotConfig and LotShape has NAโ€™s, including โ€œFR2โ€, โ€œFR3โ€ and โ€œRegโ€. Logically speaking, would expect LotFrontage to be relatively highly correlated to LotArea. Study scatter plot to see if this is indeed the case. Only look at units with LotArea <= 20000 to avoid outliers and only LotShapes of Reg and IR1 since most units are of those two types.

For LotShape = Reg, the relationship appears to be fairly linear and with IR1 it is much more random. Proceed to do simple regression on all LotShape = Reg entries and impute missing with predicted value. This will introduce multicollinearity issues but will deal with it later on in the project.

Feature Engineering

The following changes were made to eliminate redundancy, simply the data where appropriate and generate features that made more sense or were believed to better explain the target variable.

  • YearBuilt (Original construction date): The age of the house when sold makes more sense as a variable, so create variable which is AgeSold = YrSold - YearBuilt and remove YearBuilt.
  • YrSold (Year sold): Convert to nominal categorical since the factor being evaluated now is really the economy and real estate market during the year.
  • MoSold (Month sold): Same logic as YrSold and shouldnโ€™t be considered an ordinal feature. Simplify MoSold to the quarter of the year and convert to nominal categorical.
  • YearRemodAdd (Remodel date): Just convert to whether the house was remodelled or not.
  • GrLivArea & TotalBsmtSF (Above ground living area square feet & Total square feet of basement area): GrLivArea = 1stFlrSF + 2ndFlrSF + LowQualFinSF and TotalBsmtSF = BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF so they are both redundant information. Remove both features.
  • Bathrooms (Above ground and basement): To reduce some dimensions, consider half bathrooms to be 0.5 bathrooms and add to number of full bathrooms.
  • TotRmsAbvGrd (Total rooms above ground, not including bathrooms): Convert to number of non-bedroom or kitchen rooms. Create variable OtherRms = TotRmsAbvGrd - BedroomAbvGr - KitchenAbvGr and remove TotRmsAbvGrd.

Data Exploration

Since multiple linear regression will be used as one of the models, the data exploration will be done from the perspective of checking some of the linear regression assumptions and modifying the data accordingly.

Multicollinearity

Multicollinearity results in coefficients with higher standard errors and causes the model to become unstable. Therefore, need to check for and resolve multicollinearity issues between features. Plot correlation matrix and see if some highly correlated features can/should be removed.

Feature pairs with correlation > 0.7

  • LotFrontage & LotArea: This is expected, especially given that LotFrontage NA's were imputed with linear regression with LotArea as the independent variable. Will remove LotFrontage due to many NAs as well as LotArea logically being more important with higher correlation to target variable.
  • GarageYrBlt & AgeSold: This is expected since for a vast majority of houses the year built will be the same. Will remove GarageYrBlt since AgeSold is logically more important with higher correlation to target variable.
  • GarageCars & GarageArea: This is expected since garage size in car capacity is directly indicative of the garage area. Will remove GarageArea since GarageCars has higher correlation to target variable.

Normality

The assumption that residuals are normally distributed can be violated due to non-normally distributed variables as well as the presence of outliers.

Target Variable Skew

The target SalePrice is skewed, thus apply log-transformation to SalePrice. Other features are also not necessarily normally distributed, such as 1stFlrSF shown below. However features are not what needs to be normally distributed, just the residuals should be. Therefore will run model first and check if assumption is met. If not, come back and apply log transformation to variables.

Outliers

Check for outliers with the top 5 features with highest correlation to sale price. Have suspicion that some outliers may be based on SaleCondition being abnormal, so color points to show SaleCondition.

Want to systematically remove outliers instead of arbitrarily picking them out. Therefore, for each of the 4 top ordinal features, remove data points below 0.5 percentile and above 99.5 percentile for each group. Doing so can remove enough points such that certain feature values don't have any entries, such as OverallQual of 1. It is possible to implement method where if group has less than a certain count then do not remove outliers but will just go with this method for now. Removed 59 data points from the training set down to 1401.

Outliers Removed

Check 1stFlrSF for outliers.

Do not see any obvious outliers, will not remove any more.

Linearity

Sample scatter plots

There are too many features to check and do not have good way to systematically resolve non-linearity, therefore just leave features as they are for now.

Preparation for Modeling

Encoding

  • Use OrdinalEncoder on ordinal categorical features.
  • Use OneHotEncoder on nominal categorical features and always drop one column. After OneHotEncoder, realized some features are in training set but not in test set and vice versa. Remove features that are not shared. End up with 198 features.
  • Tree-based models do not need OneHotEncoder. Create a separate set where nominal categorical features are also ordinally encoded. Will try both on for tree-based models and see which route produces better results.

Setup K-Fold Cross Validation and Performance Metric

Setup 10 fold cross validation on training set to be used by all models.

Use RMSE as error metric to compare model performance.

Machine Learning Models

Use Scikit-Learn library to implement machine learning models.

Multiple Linear Regression and Regularized Linear Regression

MLR is just a simple fit onto model.

Since regularized linear regression penalizes the coefficients equally, need to standardize/normalize the features. Also look to tune alpha value for each model. Ridge, Lasso and Elastic Net all use the pipeline below and goes through GridSearchCV.

Tree-based Models

Tree-based models do not require standardization/normalization, therefore no pipeline needed and just GridSearchCV. Took an approach where started off with coarse parameter tuning followed by another finer GridSearch. Example below.

Followed this approach for Random Forest and Gradient Boost and tried both the OneHotEncoded set and the all ordinally encoded set for each. Random Forest had better score with all ordinally encoded set and Gradient Boost had better score with the OneHotEncoded set.

SVM

Since all kernel methods are based on distance, features need to be standardized/normalized so the ones with greater numeric ranges do not dominate. Therefore, similar to regularized linear regression, the pipeline below is used.

Model Comparison and Competition Submission

As shown in table, Elastic Net had the lowest mean RMSE out of all tested models. The best alpha value from GridSearchCV was ~0.0038.

For the competition submission, used optimal alpha value and Elastic Net model to predict SalesPrice using the test set. Make sure to transform the test set using the same StandardScaler fitted to the training set. Since predicted result is natural log of SalesPrice, apply exponential function and round to nearest integer to arrive at predicted price in dollars.

1304th Place out of ~4000 Submissions

Further Exploration

Feature Importance

Since in this case Elastic Net was found to be the best model, it is possible to study the coefficients and discover which features impact SalePrice the most according to the model. Since the features have been standardized, can directly compare coefficient magnitude.

It can be seen that the surface area is the biggest factor contributing to the price, followed by the overall quality and age of the house. The top ten highest coefficients all logically make sense as major contributors to price. Also checked how many coefficients went to 0 with Elastic Net with the tuned alpha value. 90 features out of 198 became 0, suggesting that possibly around half of all included features do not contribute significantly to the target variable.

Linear Regression Assumptions Revisited

Fit the MLR portion of the Elastic Net and check that linear regression assumptions mentioned previously are satisfied.

Normality

It can be seen from the plots that the residuals do appear to be normally distributed.

Heteroscedasticity

It can be seen from the plot that the residuals do appear to have relatively constant variance.

Closing Remarks

Of all the machine learning models evaluated, Elastic Net with a tuned alpha value of ~0.0038 turned out to be the best performing in this case. Fitting the model to the data, it can be confirmed that factors such as square footage, overall quality and condition, age, garage size and number of bathrooms were expectedly amongst the most important in terms of price contribution. On the other hand, a lot of the provided features were also shown to be insignificant in affecting the price. Although there is no established criteria or target in terms of prediction accuracy, the goal of being able to provide good estimations to housing prices can be met by utilizing the best performing model and parameters determined by this project.

Future Work

It was made aware to the author that there exists other versions of the dataset that includes detailed geographical information which would enable factoring in the effect of location beyond just the categorical neighborhood feature. Since location is generally known to be one of the most important factors in determining home price, having the extra dimensions to the data would undoubtedly improve the accuracy of the model.

It is also possible to attempt the project from a more descriptive modeling approach where instead of focusing on tuning the models to arrive at the most accurate predictions, can shift focus to figuring out the reason for the discount on larger homes in terms of $/ft2 as well as various other data analysis ideas and observed phenomena.

Another approach could also be to try to reduce dimensionality by using regularized linear regression or principle component analysis first to reduce/simplify the dataset before training and evaluating the models. With reduced number of features, it also becomes more feasible to address some of the linear regression assumptions, such as transforming features to establish actual linear relationships.

Would also like to try stacking/emsembling models and seeing if performance could be improved.

 

Project on GitHub

About Author

Cheng Zhao

Certified Data Analyst/Scientist with engineering background in semiconductor and electronics packaging. A detail-oriented problem solver with a passion for analytics and utilizing machine learning techniques to gain insights from data to drive business decisions and to advance automation...
View all posts by Cheng Zhao >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application