NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Predicting Iowa House Prices using Supervised Machine Learning Algorithms

Predicting Iowa House Prices using Supervised Machine Learning Algorithms

Daniel Park, Dimitri Liakhovitski, Gwendolyn Fernandez and Henry Crosby
Posted on Nov 17, 2017

Authors: Daniel Park, Dimitri Liakhovitski, Gwen Fernandez, & Henry Crosby

NYC Data Science Bootcamp, November 2017


Project Background & Objectives

The goal of the project was to explore the data set and try to predict housing prices in Ames, Iowa using various supervised machine learning techniques - a data science competition hosted by Kaggle.

The data set contained 79 explanatory variables describing almost every aspect of residential homes in Ames. It provided a rich opportunity for feature engineering and advanced regression techniques.

Our team worked on the project during the second half of October - beginning of November 2017.

Our Teamโ€™s Journey

Workflow:
  1. Understanding and exploring the data
  2. Brainstorming and first round of feature engineering
  3. Building individual predictive models
  4. Building model ensembles, testing on Kaggle Leader Board
  5. Second round of feature engineering โ€“ a different approach; shorter turnaround
  6. Building ensembles again, testing Kaggle Leader Board

 

1. Understanding and Exploring the Data

We created our github repository and fired up Jupyter Notebook. We chose to complete the project entirely in Python.

Initially, we worked individually to familiarize ourselves with the data and gather insights from exploratory data analysis (EDA) using pandas and matplotlib. We focused on the following:

  • Identifying variables that are highly correlated with Sale Price (our target variable);
  • Creating meaningful categories of predictors;
  • Feature engineering opportunities, e.g., meaningful combinations of predictions or interactions between some of them; and
  • Reducing multicolinearity.
Below are a couple of examples of data visualizations that were part of our EDA.

2. Brainstorming and First Round of Feature Engineering

After the initial exploration, we came together to discuss findings. We logged all observations in a shared google sheet that proved valuable in organizing our findings and then selecting appropriate actions to take on each feature.

For some predictors that we considered important but that had missing values we imputed those using MICE (Multivariate imputation by chained equations).
We also identified outliers by first standardizing all predictors and then calculating - for each observation - its total euclidean distance to its closest 40 "neighbor" observations. We removed 13 observations for which this metric was extremely large - to avoid a bias in our model.
After many hours of transforming, combining, imputing, and error reading we had a clean data set ready for predictive modeling. It had 104 predictors.

 

3. Building Individual Predictive Models

From there, our team finalized the first modeling-ready data set. After that, we devided up the work of applying the trying to predict house prices using the following machine learning algorithms:

  • Regularized Linear Models
  • Random Forest
  • Gradient Boosting Machines (Tree-Based)
  • Support Vector Machines and Linear Models with Kernel Trick

 

4. Building Model Ensembles, Testing on Kaggle Leader Board

After working for a while individually, we came together to look at and learn from each othersโ€™ models, which was a great learning experience. Each team member brought a deeper understanding about his/her model and showed the group a few new technical tricks discovered along the way.

To ensemble the models, we averaged individual models' predictions and then tried stacking individual model predictions using a meta model. We were pretty happy with our score on cross validation and the Kaggle Leader Board, the score being root mean squared log error (sale price was logged to make the size of the errors more proportional on the low and high side of the price spectrum). Our ensembles came out to around 0.125 on the Leader Board after scoring around 0.115 in cross validation on the training set. That 0.125 score on the Leader Board translates to a mean squared error of about 13.5% in dollar terms, good but not not quite as good as we had hoped!

 

5. Second Round of Feature Engineering โ€“ A Different Approach; Shorter Turnaround
Sensing that some information had been missed in some of the variables we originally dropped, we decided to take a risk and ignore multi-col-linearity (!) Also, we LabelEncode categorical variables in the hopes of enhancing the tree-based models. That was different from our first round - when we built dummy variables for categorical variables.
Overall it was a more parsimonious treatment of the features. However, this time we ended up with about 180 predictors.

 

6. Building ensembles again, testing Leader Board

To our surprise, and some dismay, we found that our predictive techniques performed better on the new data set ! And that despite the fact that we ignored multicolinearity, it seemed like our models were able to deal successfully with the colinearity of the house features.

After some trial and error ensembling, we arrived at an ensemble that balanced a few of our best individual models. These models appeared to balance each other in regions of housing prices where the residuals (the errors) of the model increased, usually at the low and high tails of the housing prices in Ames.

At the end, we achieved a top 8.9% position on the Kaggle Leader Board with a root mean squared log error of 0.115. This also happened to be the best in our class ๐Ÿ™‚

 

The above was a high-level look at our work flow. Here is how the soup was made:

First Approach to Feature Engineering

We decided to omit variables that had virtually no variance. Such as:

  • Street: Type of road access to property โ€“ as 99.6% of houses had paved road access
  • Utilities: Type of utilities available โ€“ as 99.9% of houses had all public utilities
  • RoofMatl: Roof material โ€“ as 98.2% had โ€˜Standard (Composite) Shinglesโ€™
  • Heating: Type of heating โ€“ as 97.8% had gas heating

For each โ€˜objectโ€™ column we looked at its meaning and level frequency. For example:

We built dummies for higher incidence levels: either manually or using pd.get_dummies()

We transformed โ€˜objectโ€™ columns that had quantitative meaning โ€“ using our judgement: "Theory Driven" transformation of quality variables to scores, that is making up a sensible score! For example: excellent kitchen quality = 3, good = 2, average = 1, no kitchen = 0.

We excluded some numeric variables if they were redundant statistically & conceptually:

We combined some highly correlated numeric variables into one and used only the new one in our models:

For example, we combined into one new variable โ€˜average_qualityโ€™ the following quality-related variables that were highly correlated โ€“ and then omitted those 4 from our models.

  • 'exterior_quality',
  • 'heating_quality',
  • 'kitchen_quality',
  • 'OverallQual'

To reduce multi-co-linearity, we regressed one variable onto several others and kept only its residual:

Example of regressing age of the house onto brick & tile foundation dummy, garage perception, number of full baths, and exterior vinyl siding dummy โ€“ because age was highly correlated with all those variables:

 

We kept updating and studying the correlation matrix, striving to reduce predictor multi-co-linearity:

We imputed missing values using mice from โ€˜fancyimputeโ€™ package:

  • โ€˜fanicyimputeโ€™ package is a Python translation of R package โ€˜miceโ€™ โ€“ the best library for missing data imputations.
  • It is a bit tricky to install โ€˜fancyimputeโ€™ on Windows.
  • MICE stands for Multiple Imputation by Chained Equations.
  • As imputation method, we used MICE.
  • Details on MICE: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/

 

Finally, we transformed the target variable and some predictors and scaled (standardized) ALL predictors

We used the log of the dependent variable, SalePrice, for modeling:

We transformed several numeric predictors using Box-Cox transformation โ€“ to make them less skewed:

 

We scaled (standardized) all predictors using StandardScaler.

 

Identifying and eliminating outliers: Outliers in the training set could bias our model!

  • We decided to identify them and get rid of them โ€“ before fitting our models.
  • We built a function that helps identify outliers in TRAIN โ€“ using scipy.spatial.distance:

 

Second Approach to Feature Engineering

We decided to leverage โ€˜label encodingโ€™ to recode all truly categorical variables:

  • Categorical variables that were truly numeric were recoded same as before โ€“ using โ€˜theory-drivenโ€™ transformation
  • All truly categorical variables were recoded using label encoding
  • As a result, our second dataset contained no dummy variables
  • Outliers: Presence of multiple label-encoded variables might bias our distance calculations, so we just excluded 2 outliers with very large living area and very low sale price
  • We tried something new now โ€“ we ignored multicollinearity!

 

Prediction!

Work Flow:

  1. Run several regression models โ€“ separately;
  2. Use Cross-Validated Grid Search to tune hyperparameters for each model and compare scores;
  3. Submit predictions to Kaggle to check modelsโ€™ performance on Public Leaderboard;
  4. Combine individual models using several different ensembling methods.

Supervised Learning Methods Used:

  • Support Vector Regression (sklearn.svm.SVR)
  • Random Forests Regression (sklearn.ensemble.RandomForestRegressor)
  • Gradient Boosted Regression (sklearn.ensemble.GradientBoostingRegressor)
  • Linear Models:
  • Ridge, Lasso, Elastic Net (sklearn.linear_model.Lasso, sklearn.linear_model.ElasticNet)
  • Kernel Ridge Regression (sklearn.kernel_ridge.KernelRidge)

Individual models: Hyper-Parameters and Score on Grid Search + CV:

Ensembling methods used:

1.Simple averaging: used the best tuned hyperparameters to fit individual models and averaged their predictions

2.Stacking: used the same hyperparameters but trained a meta model (either a lasso or a gradient boosting machine) to produce predictions

3.Due to the poor performance of our tree models, we averaged the predictions of just SVR and Elastic Net

Ensemble prediction Results

Woohoo! Top 9%, 0.115 error, good enough for us!

 

Lessons learned

  • HAVE FUN! We had a great time working together, and it was a thrill as it all came together at the end
  • Wrap it up at the right moment: learn to recognize diminishing returns at each stage of the project
  • Feature engineering is tough: so many ways to cut the data โ€“ hard to know a-priori what one is best
  • Prediction โ‰  Interpretation: Overemphasis of prediction lessens the need to understand the data and models
  • Is collinearity bad? Eliminating it at the cost of losing features might not help even for linear models, or..?
  • Cross-Validation or kaggle Leader Board?? The eternal question.
  • Our individual linear models actually performed best for us on LB
  • Our averaged model predictions performed the best on CV, close to best on LB
  • Our fancy ensemble with various meta-models didnโ€™t do quite as well

 

Thank you for reading!

We appreciate your time and any inquiries you might have (especially about new opportunities). Visit the github project of Henry, Gwen, Dimitri, and Daniel to see the code.


As a bonus, see below for a conceptual and sklearn based cheat sheet on the various models used above:

Regression Model and Hyper-Parameters โ€˜Cheat Sheetโ€™:

Linear regression with regularization:

  • Ridge and Lasso are linear models with regularization penalties: L2 and L1, respectively
    • Alpha (or lambda) tunes the magnitude of the regularization penalty
    • L2 and L1 are the degree of penalty, Euclidean or โ€œblock-wiseโ€ distance from the intercept only model
  • From https://www.r-bloggers.com/kickin-it-with-elastic-net-regression/ :
    • โ€œBecause some of the coefficients shrink to zero, the lasso doubles as a crackerjack feature selection technique in addition to a solid shrinkage method. This property gives it a leg up on ridge regression. On the other hand, the lasso will occasionally achieve poor results when thereโ€™s a high degree of collinearity in the features and ridge regression will perform better. Further, the L1 norm is underdetermined when the number of predictors exceeds the number of observations while ridge regression can handle this.โ€
  • Elastic Net Regression blends the penalty of both regressions with L1/L2 weighting. An elastic net with full L1 weighting is Lasso and full L2 is Ridge.

Random Forest:

  • n_estimators: number of trees you want to build before the maximum voting/taking averages of predictions. A larger number of trees will allow the model to perform better, but will take longer to run
  • max_features: dictates the max number of features RF can try within each split
    • Increasing max_features generally improves the RF model because it allows for a larger number of options to be considered for each tree, but this also increases the chance of overfitting
  • max_depth: depth of tree = length of longest path from root node to leaf along a single decision tree. Minimizing the depth of individual trees within RF will fight overfitting.

Support Vector Machines (SVM):

  • Kernel type - the type of kernel used. We tried:
    • โ€˜poly' - polynomial kernel
    • 'rbf' - radial basis function kernel = radial kernel
  • โ€˜degreeโ€™ (for polynomial kernel only) โ€“ the nth degree of the polynomial (linear = 1, quadratic =2, etc.)
  • โ€˜gammaโ€™ (for radial kernel only): the hyperparameter involved in the radial kernel equation (and other nonlinear type kernels)
  • โ€˜Cโ€™ - helps determine the threshold of tolerable violations to the margin and hyperplane - it's like an "error" budget.
    • As C increases - the margin gets larger variance decreases, and bias increases
    • Important: For regression we can't 'missclasify' so the goal is instead to maximize the size of the margin, but include as many points as possible within the margin
  • โ€˜epsilonโ€™ - the error term measured from any point to the hyperplane; itโ€™s a measure for whether or not a given point is inside the margin. By setting epsilon we are controlling how much error we allow the model per point. Super small epsilon leads to extreme overfitting.

Gradient Boosting Machines:

  • We used trees so, all parameters that apply to depth of tree, number of features used at each split also apply here
  • โ€˜n_estimatorsโ€™ โ€“ total number of trees we want to build
  • โ€˜learning_rateโ€™ โ€“ the shrinkage factor = the degree to which we shrink each successive treeโ€™s predictions. Small values mean that the learning is happening very slowly (large shrinkage). Large values imply that we shrink each treeโ€™s predictions less so that the trees are โ€˜learningโ€™ faster. However, that could lead to our missing the optimum point we are looking for.

Kernel Ridge Regression

  • Kernel ridge regression (KRR) combines Ridge Regression (linear least squares with l2-norm regularization) with the kernel trick - http://scikit-learn.org/stable/modules/kernel_ridge.html
  • KRR is similar to Support Vector Machines except that the loss function is different: KRR uses squared error loss for all terms while SVM only cares about loss within a margin
  • Scikit-learn allows you to tune:
    • Alpha, as with ridge regression, is the scaling of the cost for the coefficients i.e. for departing from the mean model
    • Kernel is the kernel trick used to map the features onto a different space e.g. polynomial, radial
    • Gamma is another tuning parameter that goes into the kernel
    • Degree is for poly and denotes the number of degrees of polynomial

About Authors

Daniel Park

View all posts by Daniel Park >

Dimitri Liakhovitski

I am a Data Scientist at GfK, one of the largest Market Research provider in the world. I am passionate about applying my expertise in Data Science, Statistics, Machine Learning, and Applied Psychology to real world problems. I...
View all posts by Dimitri Liakhovitski >

Gwendolyn Fernandez

View all posts by Gwendolyn Fernandez >

Henry Crosby

View all posts by Henry Crosby >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application