NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Studying Data to Predict Housing Prices

Studying Data to Predict Housing Prices

Chaoran Chen, Denise Sison and David Corrigan
Posted on Jan 8, 2019
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Background

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. He proposed this dataset to data scientists as an alternative to the often cited Boston Housing dataset.

The Kaggle competition challenges (aspiring) data scientists to predict the final price of each home. This is not going to be an easy task. The data set  has 70 explanatory variables that describe different aspects of a residential home in Ames, Iowa.

Predicting house prices is a challenging endeavor as there are a multitude of factors and variables the need to be taken into account when it comes to real estate valuation. In this challenge, creativity and ingenuity is key to be able to find the important variables that can give the most accurate prediction of house prices.

Predicting housing prices is a challenge; the task involves many variables and takes some creative thinking to pinpoint the features that actually matter in order to arrive at an accurate prediction. A recent competition on Kaggle (an online data science platform) featured housing price predictions with data from Ames, Iowa.

Visit our repository for all related codes.

 

Data cleaning and processing

The Data Set

The Kaggle competition provided the data as as train.csv (with house prices), and test.csv (new observations with all the same features as train.csv, but missing the Sale Price).

We processed the train.csv dataset, performed feature engineering by a number of methods and optimized model hyperparameters using K-fold splitting, performed a stacking algorithm of different models based on train/test optimization.

We predicted the house prices for the houses listed in the test.csv file and saved it as the required format for Kaggle submission.

Missing data

  • Top columns with null values:

Studying Data to Predict Housing Prices

With first glance we noticed that a lot of features have None value. However, with further analysis it shows that most cells with โ€œNAโ€ values actually mean โ€œNoโ€.

Example : PoolQC - Pool Quality - โ€œNAโ€ means โ€œNo Poolโ€.

With this finding, we updated all meaningful โ€œNAโ€ values to โ€œNoโ€ or numeric 0

Removing Outlier

Studying Data to Predict Housing Prices

Two houses with the most SF(square feet) had average sale price. Based on the data publisher suggestion[1] we decided to remove these two outliers.

Dummify Categorical Features

  • Nominal category features:

Studying Data to Predict Housing Prices

- Dummy with removal of most common value

  • Ordinal category features:

- Digitalize

- Example: For KitchenQual (Kitchen Quality) feature, the original data values are Po(Poor), Fa (Fair), TA (Typical/Avg), Gd (Good), Ex (Excellent). We converted the values to numeric 1,2,3,4,5

Feature combination

Based on our understanding of the dataset , we have also implemented the following strategy to create some meaningful new features:

  • Create BsmtScore

- (BsmtType1 * SF1 + BsmtType2 * SF2) / (SF1 + SF2)

  • Create GarageScore (GarageQual/GarageCond/GarageFinish too correlated)

- (GarageCond + GarageQual + GarageFinish) / 3

  • Define TotalBath

- FullBath+BsmtFullBath+0.5*HalfBath+0.5*BamtHalfBath

  • Convert โ€œYear Builtโ€  to โ€œYears Sinceโ€ concept

- YearsAgoBuilt, YearsSinceRemodel, YearsSinceSale

  • Convert SaleMonth (1-12) to Seasons

- Dec-Feb = Winter, Mar-May = Spring, Jun-Aug = Summer, Sep-Nov = Autumn

Feature Removal

  • Features with less than 4 values:

- After dummy all the nominal category features, there are 158 columns. 19 features have less than 4 values in the training dataset and were removed to reduce statistical noise

  • Check Correlation of each feature. Used 0.70 as the threshold to remove features that may cause multicollinearity problem. Features removed are:
MSSubClass_90  SaleCondition_Partial MSSubClass_190 GarageCond MSSubClass_80 MSSubClass_50
PoolArea GarageCars Fireplaces MSZoning_FV RoofMatl_Tar&Grv TotRmsAbvGrd
MSSubClass_85 MSSubClass_60 MSSubClass_45 MSSubClass_120    

  • Other special features need to be removed:

- GarageYrBuilt-- Age of garage: Impossible to reconcile homes without a garage, as this is a numeric, not categorial, value.

- Misc Feature -- MiscFeature and MiscValue (value of Misc Features)

- MSSubClass -- Can be directly inferred by combination of HouseType, BldgType, and YearBuilt.  Often highly (>0.9) correlated with HouseType

Data type checking

After all data cleaning work. We are ready to move to the feature engineering work. The last thing is to make sure all data now are in numeric type.

 

Data on Feature Engineering

Three models were used for feature engineering:

  • AIC - forward selection, we got a 64-feature list
  • Lasso - lambda was decided by using 5-fold cross validation. With penalty we got a 110-feature list
  • Boost - based on Gradient Boost, we selected the most important 70-feature

In the next step, we manually compared the 3 feature lists. We kept the features which are listed as important or significant in all lists. For features not exist in all three lists, we trained the model with/without the features. Then compared the training result (RMSLE).

We finalized a 61-feature list. For details please refer to appendix table 1

 

Modeling and hyperparameter optimization:

Data was training using five models to get the prediction of the house prices:

  • Linear - K-Fold testing of multiple linear regression
  • Elasticnet - 5-fold Cross Validation was used to get the optimal alpha and rho hyperparameters balancing Ridge/Lasso regularization algorithms
  • Ridge -  5-fold Cross Validation was used for lambda hyperparameter choosing, numeric features are all standardized to avoid large number impacts of penalization
  • RandomForest -- K-Fold testing and hyperparameter optimization using the RandomForest Regressor algorithm from sklearn
  • GBoost_AIC - K-Fold hyperparameter optimization and modeling using GradientBoosting Regressor from sklearn

 

Model stacking and train/test comparison:

Stacking model, where a 24% XGBoost, 36% Gradient Boost, 20% Random Forest, and 20% linear stacked model was found to be best against the test set.

Studying Data to Predict Housing Prices

Score evaluation

Test Location RMSLE Score
Local 0.1126
Kaggle 0.1301

 

Data Optimization:

Based on the training and test result. We noticed that there is an overfitting issue.

Because during feature engineering phase, a lot of manually comparison work was conducted to optimized the training score (training RMSLE). This can be the main reason of the overfitting.

We decided to simplify the feature selection process. Instead of using 3 models we focused the feature engineering work with AIC model only. In order to get a competitive feature list, we decided to use โ€œboth selectionโ€ procedure ( Begin with either a model that only has an intercept term or a model that includes all parameter terms, then sequentially add or remove the predictor that has the most/least impact on the model, respectively).

In this way, we finalized a 71-feature list (Appendix table 2)

Data was still trained with the same five models mentioned in the modeling section. We used the same model stacking strategy. Stacked grid search: 40% gBoost, 25% Linear, 35% RandomForest.

This time, the training score was a little higher than our last round training score. However, we noticed that the test RMSLE score is much lower. The training and test score are closer to each other.

Score evaluation

Test Location RMSLE Score
Local 0.122
Kaggle 0.118

 

To be continued:

  • For feature removing, instead of randomly removing one of the highly related features (high correlation value), we could do a comparison test to see which feature may contribute better in the model.
  • For some numeric features we didnโ€™t consider the distribution of the data. We could do some further check to consider features which may not be normally distributed. According to the result, we could use log or square strategies to normalize data.

 

Reference:

[1] De Cock, D. Journal of Statistics Education, Vol. 9, #3 (2011)

 

Appendix:

Table 1: 61-feature list selected by using AIC, Lasso, Boost

TotalSF LotArea BsmtExposure Neighborhood_NridgHt BsmtQual Exterior_MetalSd MasVnrType_BrkFace
TotalBsmtSF Street_Grvl Exterior_BrkFace Neighborhood_Crawfor OverallQual ScreenPorch Neighborhood_NoRidge
GarageArea TotalBath KitchenQual SaleType_New Neighborhood_NWAmes Foundation_Slab BldgType_2fmCon
Exterior_Stucco PoolQC RoofMatl_WdShngl BldgType_Duplex HouseStyle_1.5Fin HouseStyle_SLvl GarageType_BuiltIn
GarageType_No GarageScore Neighborhood_Sawyer ExterCond Neighborhood_BrDale Exterior_CemntBd Neighborhood_Blmngtn
BsmtScore Neighborhood_StoneBr BldgType_Twnhs MSZoning_FV Neighborhood_BrkSide MoSold_Spring GarageType_Detchd
OverallCond RoofStyle_Hip BedroomAbvGr Condition_Feedr BsmtCond Neighborhood_NPkVill BldgType_TwnhsE
YearsAgoBuilt TotRmsAbvGrd SaleCondition_Abnorml MasVnrArea Functional Condition_Artery LotConfig_CulDSac
Exterior_ImStucc Fence_GdPrv HouseStyle_2Story MasVnrType_Stone HouseStyle_SFoyer    

 

Tabel 2 : 71-feature list selected by AIC both selection procedure

TotalSF GarageAge YearsSinceSale SaleType_New Neighborhood_Edwards Exterior_HdBoard
TotalBsmtSF GarageCars KitchenQual Fireplaces CentralAir_N LowQualFinSF
LotArea TotalBath Neighborhood_NoRidge ExterQual Neighborhood_BrkSide Neighborhood_Mitchel
GarageArea EnclosedPorch SaleCondition_Family MSZoning_FV LotConfig_FR2 KitchenAbvGr
LotFrontage ScreenPorch 3SsnPorch Condition_Feedr GarageFinish LandSlope_Mod
OverallQual Neighborhood_StoneBr BedroomAbvGr Condition_RRAe GarageQual MasVnrType_BrkFace
BsmtScore PoolArea SaleCondition_Abnorml LotConfig_CulDSac GarageType_CarPort MSSubClass_50
OverallCond TotRmsAbvGrd Condition_Artery Neighborhood_Veenker Electrical_FuseA Street_Grvl
YearsAgoBuilt Functional Neighborhood_NridgHt Neighborhood_Somerst Exterior_MetalSd HeatingQC
MasVnrArea BsmtExposure Neighborhood_Crawfor FireplaceQu Exterior_WdShing SaleType_ConLI
WoodDeckSF Exterior_BrkFace BsmtQual MoSold_Winter MSSubClass_70 MasVnrType_BrkCmn
Foundation_BrkTil GarageType_Basment Neighborhood_NWAmes Neighborhood_ClearCr Foundation_Slab

 

About Authors

Chaoran Chen

Problem solving oriented. Always ready for challenges. Chaoran has 7 years work experience in the data analysis, programming and tech support areas with her solid technical background. She likes to cook. She is also a big fan of...
View all posts by Chaoran Chen >

Denise Sison

Denise is currently a Data Science Fellow at New York City Data Science Academy. Denise has extensive molecular biology and clinical research experience with a focus on infectious disease epidemiology and neonatal births outcomes. Denise was educated in...
View all posts by Denise Sison >

David Corrigan

Data Scientist and Researcher. Graduate of Columbia University (PhD) and certified Data Scientist (NYC Data Science Academy, Fall 2018)
View all posts by David Corrigan >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application