NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Using Data to Predict the Housing Market in Ames, Iowa

Using Data to Predict the Housing Market in Ames, Iowa

Justin L. Ng, Nelson Lam, Yuqin Xu and Maomao Yi
Posted on Sep 2, 2019

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Based on data, housing is the fount of middle-class wealth accumulation. Owning a house is not only a dream for millions of Americans but one of the most expensive purchases they'll ever make. From these aspirations arises one of the most powerful engines of modern economy: the US housing market. Its collective fluctuation of housing prices augurs times good and bad.

Putting machine learning to housing data is therefore a worthwhile task. We do so with the intent of predicting, as a prototypical simulation, the housing market in Ames, Iowa. 

The code for this project can be found here.

 

Dataset

Our dataset describes the sales of 2,919 residential properties in Ames, Iowa from 2006 to 2010. Our exercise is twofold: to build a model with high predictive accuracy of sales prices; and to determine which, among the 79 explanatory variables given, have the most predictive power. By splitting our dataset in half, we obtained 1,460 observations on which to tune our model and 1,459 observations on which to test its accuracy.

 

Data Exploration

As we scoped across our variables for outlying observations, we identified an egregious violation of a generally linear relationship between sale price and square footage. 

Two of the largest houses were sold well below-price, likely reflecting "special relationships" in play between sellers and buyers โ€” friends and family, perhaps. We removed the outliers to avoid skewing our predictions.

Of especial concern are the null values in our data. There are 35 columns with missing data, with four different types of missingness. 

Much of that missingness is non-existence, in fact. That a house lacks a fireplace or a swimming pool is distinct from the usual concepts of missingness, where existing values go unrecorded due to statistical error. 24 columns are documented for non-existent values, which are sufficiently imputed with 'None' or '0' depending on the format.

We imputed several categorical features via the mode, as the distribution of existing values in those features were dominated by those single elements. Other categoricals, with more evenly distributed possibilities, were imputed randomly in proportion to those potential candidates.

Lot frontage, the strip of land adjacent to public roads, is likely to scale with the size of a property in its given neighborhood. We imputed its missing values with the medians of the associated neighborhoods.

We felt it reasonably safe to assume any existing garages with missing construction dates were built in the same years as their original houses.

Last, upon examining the utilities column, we found only one observation that deviated from the modal value; and our test set not to deviate at all. We dropped this uninformative feature from our dataset.

 

Feature Engineering

Although several features exist to list the square footage of various parts of the house, no single column expresses total square footage. This seemed a major oversight, and so we engineered such a feature.

A cursory glance shows improved correlation to sale price over its component features.

In addition to total square footage, we added an annual inflation index to normalise house prices from different years.

A few numerical features are suspected to contain ordinality โ€” for instance YrSold as some years, given the housing cycle, should make more attractive prices to buy or sell. To prepare these features for encoding, we converted them to string values.

We label encoded our ordinal features into numeric rankings of [0, 1, 2, ...]. For those categorical features without intrinsic ordering, we chose to one-hot encode them. Although this expanded out our feature space considerably to 221 total features, we needed our variables to be coded numerically in order to proceed with the modeling.

We standardised our dataset to improve numeric stability and enable comparisons across multiple explanatory features.

Last, we identified all features with a distributional skewness >1.5, as computed by the Fischer-Pearson coefficient of skewness. To maximise the predictive power of our model, we normalised the distribution of these features with Box-Cox transformations.

 

Modelling

We tuned a variety of models on our dataset, selecting hyper-parameters amongst each for the lowest cross-validated root mean-squared error (RMSE) ranging from 0 to 1.0.

1) Elastic-Net

Elastic-net is a form of regularised linear regression that naturally selects for significant features. It relies on an ฮฑ-parameter, which directly penalises estimator coefficients, and a L1-ratio parameter, which determines the balance between the elimination of non-significant features versus the mere reduction of coefficient size. Tuning elastic-net gave us an RMSE of 0.1110.

This is a relatively good result and proof of the strength to which our predictions depend on linear variables. The central issue with linear regression, however, is its inability to capture the characteristics of other, nonlinear features.

2) Kernel Ridge Regression

We next turned to kernel ridge regression (KRR). The idea here is to employ a flexible set of nonlinear prediction functions, modulated by a penalty term to avoid overfitting. The kernel essentially defines a higher-dimensional inner-product subspace on which to more easily identify linear relationships within nonlinear neighborhoods. 

A choice of polynomial kernel allows us to generalise out from purely linear space while still retaining the linear option. After substantive tuning, however, we found no improvement over elastic-net, given an RMSE of 0.1115.

3) Gradient Boosting

We resorted to non-parametric modeling. Gradient boosted decision trees (GBDT) are a powerful set of models that, by iterating over sequential subtrees to minimise a loss function, sort through the hierarchy of feature space to learn linear, nonlinear, and interactive features.

Stochastic gradient boosting adds the element of subsampling, stochastically and without replacement, the training data in fitting the base learner trees. The benefit of the stochastic process over normal gradient boosting is a reduction in variance, as subtrees are decorrelated from one another. Tuning for GBDTs is considerably more involved than in our previous models. Nevertheless, we found improvement with an RMSE of 0.1086.

4) XGBoost

Taking our progress with decision trees a step further, we implemented XGBoost, which introduces a regularising ฮณ-parameter to control the complexity of tree partition. The higher the ฮณ, the larger the minimum loss threshold to split the tree at each leaf node.

XGBoost gave an RSME of 0.0487. This much superior result raised our suspicions, provoking us to apply it to the test set, where it outputted RSME 0.1156. The large difference in error made a clear case of severe overfitting. 

Given the breadth of hyperparameter space to search through and the high computational costs of doing so, we most assuredly did not find the optimal parameters for XGBoost. As we proceeded to our final step, in light of time constraints, we chose to put improvements on XGBoost aside for future consideration. 

5) Model Ensembling

Model ensembling is a powerful technique for improving prediction. Much as how GBDTs iteratively improve on an ensemble of base learner trees, so can we stack multiple base models to achieve superior performance.

There are a few schools of thought on ensembling. The general theme is to incorporate the predictions of the base models as additional features into the training set, onto which a stacking model will fit.

This meta-model weighs the predictions in order to highlight the strengths of the respective base models and smooth out their weaknesses. A diverse selection of base models, sufficiently decorrelated as to capture linear, hierarchical, and nonlinear relationships in our data, is therefore desirable.

Which predictions to include as new meta-features is a matter of method. Test predictions generated from the cross-validating process may be included; more thoroughly, so might test predictions from fitting off the entire training dataset.

We tune our meta-model for its optimal hyper-parameters by fitting over the complete set of meta-data, using the same cross-validation folds as those that generated our meta-features. In theory this should introduce some overfitting as a result of target leakage, since our meta-features are themselves derived off target values in the corresponding cross-validation folds; but in practice the effect is so minimal as to be negligible.  

At the last, our meta-model makes a final test prediction. 

Results & Feature Importance

Lasso regression was our choice of meta-model; and elastic-net, KRR, and stochastic GBDT its base models. Altogether we achieved an improved train RSME of 0.0689 and a test RSME of 0.1086.

In a final gambit, we averaged the results of our stacked meta-model with those from our XGBoost, to test if naively-weighted diversification would improve our predictions. The weights we gave to each model were proportioned to their test RSMEs.

This did, indeed, result in our best official prediction score, which is computed by Kaggle on a slightly different error metric of RMSLE: 0.1202.

The top ten features from our model are as follows:

There is some redundancy in our rankings, as GrLivArea and TotalBsmtSF roughly approximate to TotalSF. On the whole, there are few surprises.

Sales prices are most heavily influenced by overall perceived quality and total square footage. The more attractive and well-kept an exterior facade, the greater the likely sale. The year in which a house was built may determine its construction style or the likelihood to which it remains well-maintained. That latter issue of timely maintenance correlates strongly with any recent remodeling effort. And, of course, amenities such as garage space and number of full bathrooms are important to the typical buyer.

We recommend any aspiring real estate agent to target the properties that maximally exploit these variables, at least to the extent that each neighborhood permits.

 

Future Work

We exercised a variety of statistical learning techniques, examined their efficacies, and designed a stacked model to boost predictive performance. Still, there are a few areas we can think of for improvement. 

The selection, and tuning, of a meta-model and its constituents is more art than science. We could experiment with a greater breadth of algorithms to ensemble more uniquely diversified base models, each of which we would better know to capture decorrelated characteristics in our data.

Similarly, we might experiment with the choice of meta-model.

One non-standard model of interest is Catboost. Catboost has a reputation for convenient handling of, and highly accurate predictions on, categorical data. Significantly, it implements ordered boosting to train multiple models on ordered permutations of each stochastic subsampling of data, in order to reduce bias in our gradient boosting procedure. This exponential increase in modeling results in much larger computational training costs; but acceptably so for a small dataset such as ours.

Our reliance on decision trees such as stochastic GBDT, XGBoost, and the aforementioned Catboost results in enormous hyper-parameter search spaces. Even with intelligently iterative tuning, we are unlikely to be sure of finding the best set of hyper-parameters using basic grid and randomised search techniques.

We should, therefore, implement Bayesian optimisation (BO). BO updates a naive prior, our surrogate function, with multiple iterations of locally-optimised sampling, in order to more accurately understand the statistical reality of our search space. By maximising our acquisition function, which directs us to the next-best sampling location based off our updated priors, we weigh exploration of the probability space beyond our surrogate function against exploitation of the probability space within our surrogate function. In this way, BO tries to find the best hyper-parameters using the fewest steps possible.

Our reliance on one-hot encoding severely dilutes the informative value of our encoded categories. Our models would benefit from alternate coding strategies that limit this expansion of feature space. One thereafter expects, for instance, neighborhood to be among the top determinants of sales price.

Last, and perhaps of greatest importance, are feature interactions. We suspect there are significant features, with strong correlations to sales price, to be engineered but require domain knowledge or corresponding data from outside sources. Not being domain experts ourselves, discovering these features would take remarkable time and effort.

About Authors

Justin L. Ng

The author is an enthusiast of data-driven decision-making. He is a graduate of Rice University, where he studied engineering and economics, and of the Collegiate School.
View all posts by Justin L. Ng >

Nelson Lam

View all posts by Nelson Lam >

Yuqin Xu

View all posts by Yuqin Xu >

Maomao Yi

View all posts by Maomao Yi >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application