NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship 🏆 Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release 🎉
Free Lesson
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See 🔥
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular 🔥 Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New 🎉 Generative AI for Finance New 🎉 Generative AI for Marketing New 🎉
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular 🔥 Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular 🔥 Data Science R: Machine Learning Designing and Implementing Production MLOps New 🎉 Natural Language Processing for Production (NLP) New 🎉
Find Inspiration
Get Course Recommendation Must Try 💎 An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release 🎉
Free Lessons
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See 🔥
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Data Visualization > Using Data to Predict Housing Prices in Ames, Iowa

Using Data to Predict Housing Prices in Ames, Iowa

Lilliana Nishihira, Jennifer Ruddock, Hanbo Shao and Daniel Laufer
Posted on Apr 7, 2020

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Based on data predicting housing prices for consumers and for real estate practitioners can be a useful analysis in many ways.  How do you know if you’re getting a “good price” for your house? What aspects of your home most directly correlate with the price? These are the questions we sought to answer, along with finding the relationships between various house features. 

We took two different approaches for predicting Sales Price:

  1. Methodically analyzed each column by data type, checked for correlations and multicollinearity and ran multiple penalized linear regression models.
  2. Leveraged various tree-based models to pick up on non-linear relationships produced feature importances models performed with high accuracy.

The code can be found here.

Data Description

The dataset comes from a Kaggle competition. It contains 79 features of houses sold in Ames, Iowa between 2006 and 2010, with 1,460 houses in the training set and 1,459 houses in the test set.

The continuous variables mostly describe square footage in different parts of the property along with the age and number of rooms for different categories (bathroom, bedroom, etc).

The ordinal categorical variables describe the quality and condition of the property: the overall, the garage, the basement, and the veneer of the house. 

The nominal categorical variables describe things such as materials used, property type, binary features such as whether or not something is “finished” or whether there is central air, and nearby landmarks like parks or railroads. 

Exploratory Data Analysis and Feature Selection

We want to avoid the “curse of dimensionality”: when more features are in the data, the data becomes sparser and more spread out among the features, thus making statistically significant conclusions becomes increasingly difficult. Therefore, we need to find a way to choose the variables most important to the sales price and remove or combine any redundant, multicollinear or unimportant features. 

Using Data to Analyze Missingness

The above plot shows the missingness of the variables in the dataset. For most features, the missingness is due to the fact that the property did not have that feature. For example, if the house does not have a basement or garage, then all features related to the basement or garage are missing. A small number of features appear to be more randomly missing. The red vertical line at 40% indicates our cutoff used for removing variables with too many missing values. We used this metric to remove the pool quality, miscellaneous feature, alley, fence type, and fireplace quality features.

Using Data to Analyze Imbalanced classes

As shown below, we discovered many of the categorical features were described by only one class within the variable, we ran a script to check for these. If most of the dataset falls in one category, our model may have trouble picking up on the intricacies of the target variable (sales price). If 70% of observations were assigned to the same class of a given feature, then we chose to omit that feature from our analysis.

Using Data to Analyze Skewness

Above are histograms of the target variable: sales price. The sales price is obviously right-skewed (left).  To get a more normal distribution, we applied a log transform to the sales price (left). This allows for a better fitting with regression techniques. The following plot shows ​​​the skewness of variables in the dataset. Two red vertical lines indicate the skewness of -1 and 1, respectively. We could notice that variables related to porch areas and square footages appear to have high skewness. We later reduce this skewness through some feature engineering techniques. Also, the target variable (sale price) seems to have a problem with a skewness. The above log-transformation on the sale price, nevertheless, can reduce the skewness of sale price greatly.

Using Data to Analyze Correlations and Multicollinearity

Some of the continuous variables and ordinal categorical variables are fairly highly correlated with the sales price. The above heatmap shows these correlations, with darker blue indicating a higher correlation. The sales price is the first row and column in the plot, and one can see that it is most highly correlated with the overall quality of the house, and the ground living area.

These two features are obviously important in determining the sales price. The correlations between the features are also shown: the number of cars the garage can fit and the area of the garage are fairly highly correlated with the sales price with correlations of 0.64 and 0.62 respectively, but they are also very highly correlated with each other with a correlation of 0.88. This makes sense, as a bigger garage should be able to fit more cars. Therefore, one of these variables may be redundant and unnecessary for the data. Similarly, the ground living area is highly correlated to the number of rooms above ground level. In order to remove multicollinear variables, we also ran Lasso and ridge regressions on the data.

After the above analysis, remaining variables are as follows: lot area, house style, overall quality, overall condition, year built, masonry veneer area, exterior quality, heating quality, number of bedrooms above ground, kitchen quality, total rooms above ground, number of fireplaces, garage area, month sold, year sold. 

Before moving into feature engineering, we would like to take a look on variables that have high correlation with sale price.

The plot above describes sale price against the ground living area. We notice the strong positive relationship between these two variables. In general, larger living area leads to the higher sale price. Nevertheless, two observations on the lower right corner, which have large living areas but incredibly low sale price, are outliers and we remove them in the later model prediction.

The plot above is the boxplot for each quality level. We could see an obvious increasing trend between overall quality and sale price. On the other hand, there seem to more variations in the sale price for higher quality levels.

Plots above indicate that year and month could make a difference in the sale price. For example, financial crisis was around the year of 2008 and the sale price experienced a minor increase due to the crisis. From the plot of monthly sale price, houses sold in spring tend to have a lower sale price. Because of this difference, we will consider year and month as categorical variables and dummify them before fitting the models.

Feature Engineering and Data Imputation

Aggregated Variables include the following:

“Total area” is the sum of the ground living area and the basement area. The following plot shows that there is a positive relation between sale price and the total square footage ("total area"). As a matter of fact, the correlation between tottal square footage and sale price is close to 0.8, which is higher than any of the individual area-related variabales.

“Total porch area” is the sum of the areas of any porches or decks. As above, the following plot demonstrates the positive relationship between total porch area and the target variable sale price. Though the correlation is around 0.4, which is relatively low. Nevertheless, it is higher than any individual porch area varibles.

“Half baths” and “full baths” were combined from basement half baths/full baths since there was not a difference in correlation with sales price, and these reduce the number of columns. The following plot shows the relationship between the total number of bathrooms (newly-created variable) and the sale price.

Before fitting models, the last step is to impute the remaining variables with missing values. Houses with missing basement features were assumed to not have basements and therefore 0 basement bathrooms and 0 basement area. Similarly, the missing garage areas and masonry veneer areas were imputed as 0. Kitchen quality missing values were imputed as “typical/average”, as that was the median and the mode for that feature.

Models

Using Data to Analyze Elastic Net

After running a 5-fold cross-validation on the elastic net hyperparameters, we found the best fit to be with a penalization parameter (alpha) of 0.01, and a Lasso/Manhattan ratio of 0.01. The feature coefficients are shown below. While the feature coefficients are not directly interpretable as in an unpenalized regression, the feature coefficients also indicate the importance of each feature in determining the sale price.

The plot shows the magnitude of each coefficient, with the sign given by color. Dummy variables for heating quality, kitchen quality, exterior quality, and house style are given by variables starting with “HQC”,”KQ”, “EQ”, and “HS”, respectively. We can see that the most important features are the total square footage, the year built, the overall quality, the overall condition, and the garage area. This is fairly easy to understand, as a larger house with a large garage that was built recently and is of good quality and condition should cost more. 

We used the pace at which the coefficients dropped to zero with an increasing lambda in multiple Lasso regressions to further reduce the number of features. We then ran a grid search on elastic net. This allows us to search for whether to use Ridge penalization or Lasso penalization (rho hyperparameter) and how much of each to add (lamba parameter). 

This gave us a training error of 0.1019 and a testing error of 0.1395

Although the testing error is a bit higher than the training error and could point to overfitting, the process we used was very transparent and can be tweaked or changed without losing the entire methodology. We would choose this model to give to a client that doesn’t want a black-box approach. 

Using Data to Analyze Tree-Based Models

In addition to a regularized multiple linear regression, we also performed a few tree-based models: random forest, gradient boost, and extra-gradient (XG) boost. The feature importances plotted above are for models using the feature selection and engineering described above. The black dotted line marks a feature importance of 0.05.

Both the random forest and gradient-boosted models heavily favor house style and garage area, with random forest only having those two features with importance greater than 0.05, and the gradient-boosted model only also having the overall house condition over 0.05, but just barely. The XG-boosted model shared importance among more variables with 6 variables that have importance greater than 0.05: house style, number of half baths, garage area, total area, overall condition, and masonry veneer area.

For all these models this  raises questions: why do the random forest and gradient boosted models prefer house style and garage area over the size of the house, and why does XG boost prefer half baths over full baths? This may be because if two variables have a high correlation or high contingency, tree-based methods may choose either one with little discrimination. House style may be closely related to the size of the house. For example, a 2-story house should be bigger than a 1-story one.

While tree-based models are less interpretable than linear models, they do have one significant advantage: categorical features do not necessarily have to be converted to dummy variables and can just use label encoding. That keeps the dimensionality low and the feature space smaller and therefore easier to fit. Therefore, we added back in most of the variables and reperformed the grid searches.

Below are the feature importances for the 30 features with the highest importance for the XG Boost model. Here, we see that random forest and gradient boost again favor two features, but this time they are different features: lot area and exterior quality. Meanwhile, the most important features for XG boost are overall condition, lot area, overall quality, the fence type, exterior quality, and garage quality. 

 

Conclusions

Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. Below is a table of the training and test (Kaggle) errors for each model. The “v2” indicates fits performed on the dataset with features added back in. The highest training error also has the highest test error on the random forest model with the more aggressive feature selection.

However, that fit also appears to have the lowest amount of overfitting with a difference in errors of 1.85%, while the rest have at least 3% overfitting. Adding more features lowered the errors in all the tree-based models.  The lowest overall training and test errors were the result of using XG boost. This model more equally partitioned the feature importance over the variables, unlike random forest and gradient boost, which probably contributes to its overall better fit.

Model MLR RF Grad XG RF - v2 Grad -v2 XG - v2
Training Error 0.1019 0.1253 0.1053 0.1024 0.1145 0.0918 0.0881
Test Error 0.1395 0.1438 0.1353 0.1342 0.1452 0.1340 0.1247
Overfitting 0.0376 0.0185 0.0300 0.0318 0.0307 0.0422 0.0366

On the reduced dataset, the multiple linear regression performed similarly to  the boosted models. This indicated the overall linearity of the features with the log transform of the sales price. While adding more features created an overall better fit, the main advantage of the linear regression is in its interpretability. It is easy to see how the features relate to sales price. For anyone looking to buy or sell a house, the most important features are overall size, quality, and condition, along with the house style, year built, lot size, and garage size.

 

About Authors

Lilliana Nishihira

A NY native exploring intersections of data, arts, business, and humanitarianism. I have my Bachelor's in Mathematics from Clark University. With a background in Digital Media, I am particularly interested in the way data describes behavior. I often...
View all posts by Lilliana Nishihira >

Jennifer Ruddock

A physical chemist by training, my Ph.D. involved analyzing 100+ Tb datasets using Python. I fell in love with the world of data and chose to pursue data science by getting certified with the NYC Data Science Academy,...
View all posts by Jennifer Ruddock >

Hanbo Shao

Data Scientist with a strong quantitative background in mathematics and operations research. Detail-oriented, curious and highly motivated to apply data analysis and machine learning skills into solving real-life problems. A collaborative team player and loves to learn new...
View all posts by Hanbo Shao >

Daniel Laufer

Salutations! I'm a Data Science Fellow here at New York Data Science Academy and a theoretical physicist and historian by training.
View all posts by Daniel Laufer >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Data Analysis
Car Sales Report R Shiny App
Data Analysis
Injury Analysis of Soccer Players with Python
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    © 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application