NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Data Visualization > Using Data to Predict House Prices in Ames, Iowa

Using Data to Predict House Prices in Ames, Iowa

Quoc Nguyen, Sam Collier, Sashank Gummella and George Alster
Posted on Jun 2, 2019

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

GitHub

Data Science Introduction

The aim of this project was to use data predict house prices in Ames, Iowa as part of a Kaggle competition and to understand the attributes that contribute to the price of an individual house. There is a multitude of factors that go into the price of a house and pinpointing their effect is crucial in the real estate industry. Our approach to exploring this question was to use pair programming, but as a whole group of 4, and to spend time investigating many different models, rather than shortlisting and tuning a smaller number.

Exploratory Data Analysis

Our first task was to prepare the data for modelling. We began with the form of the target variable (the sale price), which was highly skewed. This can lower the performance of machine learning models (e.g. violating homoscedasticity in linear regression), and so we used a box-cox transformation to normalize it. It turned out that the optimum lambda returned by the box-cox function was very close to zero, so for interpretability, we decided to just use the value of zero (i.e. taking the natural logarithm). This vastly improved the normality of the target, as can be seen from the quantile plots and histograms below.

  • Using Data to Predict House Prices in Ames, Iowa
  • Using Data to Predict House Prices in Ames, Iowa
  • Using Data to Predict House Prices in Ames, Iowa
  • Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

Missingness

There were 3 types of missingness in the data. The first was data missing completely at random such as those columns with only one (or in fact no) dark blue bar in the above heatmap. From the heatmap, there was also clearly data missing at random (e.g. attributes pertaining to the basement). To deal with these we mostly used random imputation as the proportion of NaNโ€™s in any one column was never very large.

We did however work on a column by column basis, so if there was a clear majority of some value in a column we imputed using that. The final type of missingness that we encountered was NaNโ€™s that really meant โ€œNoneโ€ (e.g. NaN was used to mean no pool in the PoolQC column). For most of these columns, we were able to compare to some other column to work out whether the value was truly a NaN or not, except for the case of the Fence column, which we had to drop instead.

Feature Engineering

After completion of the EDA, the next step of this project was to conduct some feature engineering. This dataset contains a large number of features, especially after the categorical features are dummified. Therefore, our teamโ€™s aim was to reduce these features as much as possible so as to avoid the curse of dimensionality and multicollinearity effects, without losing any important information.

First immediate feature selection

The first immediate feature selection was to drop a few of the categorical columns that had extremely low variance as they did not offer much information to help the predictions. One example of this was the utilities feature which had the same value for all but one of its observations.

Next step

The next step of the process was to observe the correlations between all quantitative variables and the target, Sale Price. From this, we can drop any features that have a particularly low correlation. In this instance, we dropped โ€˜MiscValโ€™ which represents the value for the miscellaneous features of the house. This figure is also useful as it highlights the important features that have high correlation such as the overall quality of the property and the square footage above ground.

Using Data to Predict House Prices in Ames, Iowa

Further investigation

After some further investigation of the features in the dataset, it was decided that several of the features were very similar and could be combined. This decision was made based on logical reasoning by the team and was also influenced by the multicollinearity plot showing the R-squared values for each feature, which will be discussed later. The equations below show the four features created as derivatives of existing features.

Using Data to Predict House Prices in Ames, Iowa

As a result of the above, we are able to convert 15 different features into four whilst maintaining all the same key information. After these four features were created, the first step of feature engineering was completed and what remained was our base feature set. From this base feature set we then developed three more optimal feature sets in an attempt to make a better prediction, or at least maintain the prediction accuracy with fewer features. These three optimal feature sets can be found in the table below.

Using Data to Predict House Prices in Ames, Iowa

Each of these feature sets was generated in a different manner. To create optimal features 1 we generated the lasso graph to show the change in value of the coefficient for each feature as lambda increases. From this graph, an โ€˜optimalโ€™ value of lambda was manually selected and all features whose coefficients were non zero at that lambda were kept. The second set was created in a similar manner, however, instead of manually selecting the value of alpha, a grid search algorithm was used to find the optimal value of the parameter. At this value, all features with non-zero coefficients were kept.

The final set

The final set was developed by a bootstrapping technique on the lasso regression model. One thousand samples were run and the averages and standard deviations of the results were found to develop approximate 95% confidence intervals for the coefficient values. If the interval for a specific coefficient did not cross zero, then it's corresponding feature was kept for the optimal feature set. We also note from the above table the only three attributes consistent across all three sets are the year remodelled, overall quality and the total number of bathrooms.

As briefly mentioned earlier, a multiple linear regression model was run and the following graph was generated in order to test for multicollinearity amongst the features in the dataset. The figure below shows the result after columns were dropped. The initial result of this graph displayed a high R-squared value for the garage cars and garage area attributes of 0.8. We therefore dropped garage cars which has a significant effect on the R-squared value of garage area. We note that Overall Quality and TotalSF (total square footage) still have relatively large collinearity values, however as they are so important in the prediction we did not drop them.

Using Data to Predict House Prices in Ames, Iowa

Modelling - Linear Models

We began our modelling by exploring all linear model options, including multiple linear regression, Huber regression, ridge regression, lasso regression, and elastic-net regression. Before we began training our data we double checked that all of the assumptions for linear regression were met within the dataset including linearity, normality, constant variance, independent errors, and minimal multicollinearity. We have touched upon trying to reduce multicollinearity and transformed the data so as to adhere to linearity within the previous section already, but below are a few visuals showing that the other assumptions hold true.

Q-Q plot

Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

The Q-Q plot shows the distribution of residuals against the distribution of theoretical values and the straight line indicates fairly good normality in the data. Also shown is a residual plot which shows the spread of residuals throughout the target values. The even distribution of positive and negative residuals across the data shows constant variance. Lastly, as a check for independent errors, we ran a Durbin Watson statistic which resulted in a value very close to 2 on a scale from 0-4, which clarifies that the errors are independent.

Modelling Procedure

As part of our modelling procedure, we took a few different routes and looked at a few different measurements to evaluate how well the model fitted the data. In one instance we performed 5-fold cross-validation and found the R-squared average from all folds. We also split the training set data into an 80%-20% train-test split so that we could evaluate the performance of our model on unseen data whilst still knowing the true target values. Lastly, we created a function to calculate the adjusted R-squared as the multitude of features used always increases the regular R-squared, so the adjusted R-squared is a better measure to compare the different optimal feature sets.

Using Data to Predict House Prices in Ames, Iowa

Multiple linear regression model

\We first looked at a baseline multiple linear regression model. The model performed well but we recognize the high possibility of multicollinearity because of the multitude of features even after our feature selection process. As a result we switched our focus to penalized models including ridge, lasso, and elastic-net. These models penalized large coefficients so as to lower the effect of multicollinearity, thus increasing bias minimally whilst largely decreasing model variance. We also used a Huber model which is known for handling outliers in the dataset. But because the dataset was so vast with minimal outliers, and the fact that the Huber model utilizes the median instead of the mean, it performed the worst out of all linear models.

The elastic-net model performed the best out of the penalized regression models. Most of the models also seemed to slightly overfit the data when we performed the train-test split. As part of our optimal feature selection, we utilized the lasso modelโ€™s results to determine the features with the greatest predictive power. As shown in the graphs below we ran all linear models again using the different optimal feature sets.

Using Data to Predict House Prices in Ames, Iowa

From the results above it seems our baseline feature set and optimal feature set 2 performed the best across all models except for Huber, which we have already accounted for as not a good fit in this situation. From comparing the scores alone, it seems as though our feature selection did not do a lot to improve on the model performance, but the optimal features improve interpretability because optimal features 2 gets almost the exact same score as the base feature set, using only half the number of variables. Again, R-squared will always increase with the number of features used.

Modelling - Non-Linear Models

We next explored the non-linear models and how well they performed on the housing prices. Below we present 4 different scores for each of 6 different models. We compare how the mean of a 5-fold CV, adjusted R-squared, and train and test scores vary between non-linear models and our best linear model.

Using Data to Predict House Prices in Ames, Iowa

Random forest model

The random forest model performed the best in the train set. Gradient boost, however, yielded the lowest Root Mean Square Error (RMSE). Non-linear models overfit the data significantly as the graph shows the train score being higher than the test for all non-linear models. The tree based models tended to perform better than the linear models with gradient boosting performing better in particular. Below we examined the feature importance of the Random Forest model:

Using Data to Predict House Prices in Ames, Iowa

It was interesting to see that the important features were, in great part, similar to the results of the Lasso regression feature selection; OverallQual (Overall Quality) and TotalSF (Total Square Footage)  were the most important features. However it was a surprise to see MonthSold and YearSold included in the features above since they had shown little correlation to the SalePrice in the correlation heatmap.

Conclusion and Future Work

We found that fitting a gradient boosting regressor to a vecstack of all of our used models, both linear and non-linear, achieved the best score (lowest RSME) of 0.12909. Our first run model, elastic net, achieved 0.13730 with optimal hyperparameters, alpha (lambda) = 0.00016 and l1_ratio (rho) = 0.384.

For future work, we determined it would be necessary to have better statistical feature and model selection. The Akaike Information Criterion (AIC),  Bayesian Information Criterion (BIC) and Recursive Feature Elimination with Cross Validation (RFECV) would all be useful in building better models through improved analyses for feature selection.

We will also return to our missing data imputation process and use a KNearestNeighbors (knn) imputation instead to analyze its effect on the results. In addition to all of this, we were curious to see what the results would be if we had used a Label Encoder for our nominal categorical variables rather than one-hot encoding, as this tends to improve tree-based modelโ€™s performance. Finally, we will also return to do greater feature engineering, including outlier removal and feature scaling. 

Please follow along as we look to improve our housing price prediction modelling and explore our github page to see our Python code. 

About Authors

Quoc Nguyen

Quoc graduated from Cornell University in 2015 with a B.S. in Civil Engineering and continued to obtain his M.Eng in Engineering Management in 2016. As a professional, Quoc worked at Skanska, an international construction and development group, for...
View all posts by Quoc Nguyen >

Sam Collier

Sam Collier hails from across the pond where he earned both a First Class BSc and an MEng in Engineering Science from the University of Oxford, specializing in energy engineering. While in London, he did some work for...
View all posts by Sam Collier >

Sashank Gummella

Sashank graduated from the University of Illinois in May of 2018 with a Bachelor of Science degree in Aerospace Engineering. He's had the privilege of interning at NASA Langley Research Center, where he was involved with the design...
View all posts by Sashank Gummella >

George Alster

George graduated with First Class Honours in his Chemical Engineering (MEng) degree at University College London (UCL) in 2018. Alongside completing groundbreaking research in the Electrochemical Innovation Lab at his university, George also has experience in the private...
View all posts by George Alster >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application