NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Machine Learning Project: Ames Housing Dataset

Machine Learning Project: Ames Housing Dataset

Thomas Deegan, Brandon Deniz, Hayley Caddes and John McGlynn
Posted on Sep 3, 2018

Introduction and Background

The Ames Housing Dataset was introduced by Professor Dean De Cock in 2011 as an alternative to the Boston Housing Dataset (Harrison and Rubinfeld, 1978). It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010. There are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features describing each houseโ€™s size, quality, area, age, and other miscellaneous attributes. For this project, our objective was to apply machine learning techniques to predict the sale price of houses based on their features.

Pre-Processing

In order to get a better understanding of what we were working with, we started off with some exploratory data analysis. We quickly found that overall material and quality (OverallQual, discrete) and above ground square footage (GrLivArea, continuous) had the strongest relationship with sale price. Using a widget (shown below), we then examined the remaining features to get a sense of which were significant and how we would be able to feed them into linear and tree-based machine learning algorithms. 53 of the original features were kept in some fashion.  

For categorical features, the majority were handled using one-hot encoding, with minority classes below a certain number of observations being excluded. Certain discrete features were changed to binaries if we found their presence to be more impactful than their frequency (eg. Number of fireplaces โ†’ Is there a fireplace?). Given the range of some of the continuous features in the data, we found it useful to apply log transformations where appropriate, like each houseโ€™s lot size (LotArea). Lastly, there were some special cases like the self-explanatory YearBuilt feature (figure below).  We determined there to be no meaningful relationship with sale price for values prior to 1950, so we made that the minimum value.

After this initial round of pre-processing and deciding to remove two outliers (with square footage >4,000sqft. and sale price <$200k), we were well on our way to making our first set of models!

 

Linear Models

This problem lends itself well to linear regression. In fact, we can draw a simple regression line between above grade square feet and sale price that explains 54% of variance in sale price! This model produces a cross-validation error of 0.273 in terms of Root Mean Squared Logarithmic Error (RMSLE).

As a quick side note, we choose to use RMSLE for model evaluation in order to match the scoring metric of the Kaggle competition. RMSLE โ€˜standardizesโ€™ prediction errors between cheap and expensive houses so that we are not incentivized to build a model that predicts better (on a percentage basis) for expensive homes than cheaper homes. Practically, it also makes our cross-validation results a more accurate indicator for Kaggle scores.

While we can obviously do better than a one-variable model, the simplistic case highlights an issue that we will need to account for in linear regression. Inspecting the residual plot, we can see a classic case of โ€˜fanningโ€™ residuals. This violates one of the key assumptions of linear models -- that the error term be identically distributed.

The underlying issue is non-normal distributions of both sale price and above grade square feet.

Applying the box-cox transformation to both variables results in much more normal distributions, and a regression on the transformed variables produces a much better-behaved error plot. Surprisingly, this improvement is not associated with a reduction of model error in the case of simple linear regression. However, box cox transformation does result in a massive reduction in error when employed for multiple linear regression models.

Of course, we can start to improve our linear predictions by incorporating the influence of additional explanatory variables. There are strong (and obvious) relationships between some of the explanatory variables, such as above grade square feet and above grade rooms, and lot area and lot frontage. Regularization techniques will be critical for controlling for this multicollinearity.

Indeed, elastic-net regularization reduces cross-validation RMSLE from 0.251 to 0.118. Elastic net, ridge, and lasso all performed equally well in cross-validation.

The variable coefficients from elastic net confirmed our insights from exploratory data analysis. House size and quality seem to be the most important variables for determining sale price.

Tree-Based Models

Going beyond linear regression, we next tried fitting our data to tree-based models. The simple decision tree below, with a maximum depth of 3, gives an idea of the features on which our models could split and the breakdown of how our data might be divided:

We initially feed our data to a straightforward decision tree regressor from the Scikit-Learn Python package. The feature importance plot is shown below:

While these last two are almost trivial examples, they help us get a sense of the features that might be important for this class of models as well as visualize the importance of the hyperparameters we can tune for our tree-based learners. So far, we see that the features we hypothesized to be important, from our exploratory data analysis and feature engineering, are in fact significant. We do not spend much time on the untuned decision tree model, even though it resulted in a 0.141 RMSLE training score and a cross-validation RMSLE score of 0.181; these scores are better than we expected, and they most likely are the result of our extensive feature engineering. Nevertheless, we move on to a random forest model, where we can tune hyperparameters for the number of trees, the maximum tree depth, the maximum number of features considered at a split, the minimum number of samples required to make a split, and the minimum number of samples required at a node. The feature importance plot for our random forest is shown below:

The tuned random forest resulted in a training RMSLE of 0.122 and a cross-validation RMSLE of 0.131. This score is much better than the single decision tree in large part because a random forest reduces the variance (overfitting) one sees when working with only one tree. However, there is a higher bias with a random forest than with a decision tree. This is because only part of the training data is used to train the model (bootstrapping), so naturally, higher bias occurs in each tree. Additionally, the random forest algorithm limits the number of features on which the data can be split at each node, which in turn means the number of variables with which the data can be explained is limited, inducing higher bias. In an attempt to lower the bias, we next look to a gradient boosted tree-based model; the plots below show the difference between the tuned and untuned feature importance for this model.

Even with the untuned model, the cross-validation RMSLE is 0.116 (training RMSLE 0.037), which is already better than the random forest. This is because the bias of the model is reduced because of boosting features. This model tries to reduce the error in predictions by, for example, focusing on poor predictions and trying to model them better in the next iteration, and hence it reduces bias (underfitting). One can see the importance of tuning hyperparameters, though, when looking at the important features from the untuned model. There are some potentially collinear features, such as LotArea vs. LotFrontage and GarageYrBuilt vs. YearBuilt, that show up, while some features shown to be important in our previous models, such as OverallQual and OverallCond, are absent. Yet by sequentially tuning the number of trees, the max depth, the min samples for a split, the min samples for a leaf, and then increasing the number of trees while simultaneously decreasing the learning rate, we see less โ€œredundantโ€ features and more relevant features (OverallCond, OverallQual, Functional) with high importance for our model. The tuned model is the best individual model we trained, with a training RMSLE of 0.082 and a cross-validation RMSLE of 0.112. Another visualization of the difference between the tuned and untuned models is shown below.

Stacked Models

Despite being pleased with the performance of our standalone models, we thought it would interesting to use this opportunity to explore some ensembling methods, with the hope being that by combining several strong standalone models, we could produce a meta-model that is a better overall predictor.

Up to this point, our highest performing models, as judged by the Kaggle Public Leaderboard (PL), were an ElasticNet regression model (PL RMSLE 0.121), a tuned gradient boosting machine (PL RMSLE 0.122), and a tuned random forest model (PL RMSLE 0.145).  Our initial approach was to average the predictions from each of our three top models and attribute equal weight to each prediction.  Interestingly, our score did not improve.  However, we did see an improvement once we dropped the weakest link (random forest), and attributed equal weight to the ElasticNet and the gradient boosting machine.  This resulted in our strongest model up to this point (PL RMSLE 0.118).  We attribute this increase in performance to the increased diversity of our ensemble that results from dropping the 2nd tree-based model.

In our final model, we decided we would explore stacking.  We opted to include each of our three top standalone models (ElasticNet, Gradient Boosting Machine, and Random Forest) as base learners and opted for a linear regression as our meta-model.

As judged by the Public Leaderboard, this resulted in our top model!  The RMSLE was 0.117, which as of the time of this publication was in the top 15% of submissions.

Conclusion

Our largest takeaway from working with the Ames Housing Dataset was the value in careful, thoughtful feature engineering.  We attribute the strong performance of our model to the time we put into this phase.  If we were to continue working with this dataset, we would explore both the applicability and effectiveness of principal component analysis and multiple correspondence analysis in reducing dimensionality.  It would be interesting, as well, to explore the effectiveness of different, strategic tuning parameters in our stacked models.

About Authors

Thomas Deegan

Graduate Student in Computer Science at The University of Chicago
View all posts by Thomas Deegan >

Brandon Deniz

View all posts by Brandon Deniz >

Hayley Caddes

View all posts by Hayley Caddes >

John McGlynn

John McGlynn is a data scientist and a strategic leader. As a data scientist, he is skilled at Python, R, SQL, data visualization, and machine learning. As a strategist, he is skilled at using communication, collaboration, and design...
View all posts by John McGlynn >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application