NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Data Analysis The Price is Right Predicting Home Sale Prices

Data Analysis The Price is Right Predicting Home Sale Prices

cameron wisdom, Jonathan Presley, Aydan Ellman and Kyle Greeley
Posted on Jun 14, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Predicting home sale prices in Ames, Iowa using machine learning

An introduction to the data

"Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad."

Kaggle.com

As one begins their data science journey, it is nearly a rite of passage to become familiarized with the Boston Housing Data Set which has long been used to train and test students on regression concepts. The data included in the Boston set is often used within the scope of final assignment during regression segments of Statistical Methods courses. However, as time passes the Boston data becomes less relevant to important variables in contemporary markets.

The most obvious indicator of the previously mentioned presumption is the steep increase in housing prices in the three decades since the release of the Boston Housing data. The dataset is also limited by its size and scope, and lacks additional information that may be beneficial in feature engineering.

In response to the need for a more contemporary data that can be used to implement a greater variety of machine learning techniques to a new generation of data scientists, Dr. Dean De Cock of Truman State University compiled the Ames, Iowa Housing Data Set. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values.

The data serves as the basis for one of the more notable competitions on Kaggle and has produced thousands of individual kernels, with a wide variety of methodologies used to best predict the sales price of homes in Ames with respect to the 81 variables provided. 

Exploring and analyzing the data

Looking for Outliers

Let's get a quick look at the sale price in relation to the ground floor living area (GrLivArea). On the left, we can see that there are two observations with high GrLiveArea values and low SalePrice values. Furthermore, these data points do not seem to follow the general linear trend that we see in the rest of the data. In this scenario, we will remove them from our dataset in the anticipation that it will allow us to build a stronger prediction model later on. The scatterplot on the left shows the dataset before the outliers were removed, and the plot on the right shows the dataset after it has been cleaned.

A look at the target variable: Sale Price

Let's start with a simple histogram of the Sale Price variable. From this histogram we can deduce that it deviates from the normal distribution, has an appreciable positive skewness, and displays peakedness.

Unaltered Sale Price fitted onto the probability plot of a normal distribution

In order to properly analyze, process, and utilize the Sale Price data, we need to transform it to better represent a normal distribution. In this case, we will apply a log transformation of the data. The figures below reflect the distribution and probability plots after log transformation of the Sale Price variable.

The skew now appears to be fixed and the data appears to be more normally distributed.

Feature analysis and engineering

Now that our target variable is transformed and normally distributed we can now look at our predictor variables to discover any missingness, collinearity, and to engineer new variables to better represent our predictors without suffering from the curse of high-dimensionality. 

Looking at correlation between the predictors and target

This correlation plot shows the relationship between a selection of variables including predictors (x) and the target (y). We can quickly visualize relationships of some predictor variables with one another, such as "GarageYrBlt" and "YearBuilt". The relationship between these two variables appears to be highly positively correlated as it appears as a lighter white (>.75 correlation) square in the matrix.

We are also interested in any variables that might be highly negatively correlated with one another, such as "EnclosedPorch" and "YearBuilt". These variables appear as dark squares (~-.50 correlation) in the correlation plot.

We should also take note of the correlation between SalePrice and the selected predictor variables shown in the correlation plot. This may give us a quick indication as to what variables may be important when we begin to put our prediction models together.

Searching for missing values in our predictor variables

We did a thorough search of the dataframe to discover which, if any, variables were missing values for any observation and performed a calculation to discover what percentage of the observations for that variable included missing data.

There are several features (PoolQC, MiscFeature, Alley, Fence, and FireplaceQu) where at least 50% of the observations for that variable include missing data. We may consider removing these variables altogether as we progress through the feature engineering process, but for now we will impute the missing values.

Imputing missing values

We found that many of the missing values for categorical variables were not, in fact, missing.  Rather, โ€œNoneโ€ was one of the response categories for that feature mainly because the house did not have that feature like a basement or garage.  For those variables, such as BsmtQual and GarageType, we imputed the value โ€œNoneโ€ so that it was recognized as a string and not a missing value.

A similar issue with missing house features resulted in imputing zero as values for numerical variables such as GarageArea and TotalBsmtSF.  Lastly, imputation of the mode value for categorical variables such as Electrical and Utilities because the large majority of those values were one type of response and there were very few missing values for that variable.

Engineering new features

To handle any potential multicollinearity, we combined several sets of similar predictor variables to produce the sum of their overall values. For example, we combined the total square footage variables for each observation into one variable labeled "TotalSF". Additionally, we combined the total count of various bathrooms into one variable labeled "TotalBath".

Looking for skewness in predictor variables

A quick look at the skewness of our features shows that 59 numerical variables are highly skewed and will require boxcox transformation.

We will use the scipy function boxcox1p which computes the Box-Cox transformation of 1 + x. Note that setting ฮป=0 is equivalent to log1p used above for the target variable.

The dataset is now clean and ready for prediction!

Applying machine learning models

Once we had a clean dataset to deploy various models on, our team branched out and took unique approaches before coming back together to build the best overall model out of the sum of our parts. Below, each author will briefly discuss their model selection and application process. 

Kyle

I used a similar approach to feature engineering to my teammates with one addition. I decided try sklearn's Recursive Feature Elimination. RFE is first trained on the initial set of features, assigns weights, and then the least important features are pruned from the current set.

After pruning, RFE ranks the features based on importance. Based upon the rank I would use the features that ranked 10 or higher. Using the ranked features, I sought to use Multiple Linear Regression and Ridge Regression as my models for testing. I will need to continue feature engineering as they did not perform as well as expected.

Jonathan

Additional Feature Engineering Note:  In addition to the steps described above (combining variables with describe similar aspects of the house), I changed ordinal variables with string values into numerical values so that I could include them in a correlation matrix.  Before and after dummification, I dropped variables that had a lower correlation to SalePrice than the Id column, with the understanding that it has less correlation than random noise and would not provide any useful information to predict the SalePrice value.

Models:  Since there seemed to be signs of multicollinearity in the data, seen by the correlation matrix, I knew that penalized regression would help mitigate these violations of assumptions of multiple linear regression. Also, by observing other submissions with worse outcomes using tree-based models, I was fairly confident that penalized regression would result in the best predictive performance.  Therefore, I performed Ridge and Lasso regressions with similar results.

For Ridge regression, I tested 10,000 different alpha values between 10^(-5) and 30 and the best alpha, where the test RMSE was closest to the train RMSE (a sign of not overfitting or underfitting), turned out to be 14.4434.  The model score was 93.14% and the resulting Kaggle score (RMSE) was 0.12223.  

Lasso regression resulted in similar results. After testing 1,000 values of alpha between 10^(-4) and 10^(-3), the best alpha was 4.360E-4, the model score was 93.01%, and the resulting Kaggle score was 0.12239.  Therefore, the Ridge model was slightly better than the Lasso model in this case.

The better score of 0.12223 gave us a rank of 1328 out of 4790 competitors, within the 28th percentile.  

Aydan

After narrowing down the continuous, numerical features in our feature selection and engineering phase, I performed a principal component analysis with the aim of creating new features that I could add back in to my data set and use in the final model. Using PCA, there were a total of 5 new features that accounted for over 80% of the explained variance in the data. Taking these 5 new features created from a PCA analysis, along with the continuous and categorical variables from previous steps. I moved on to building a model.

Because the new features were sure to introduce multicollinearity to the data, I used linear regression with an elastic net penalty term and a random forest regressor which are more robust to correlated features. Using a grid search to optimize the hyperparameters of the linear regression, I was able to find the most optimal values of lambda and rho for linear regression, and the optimal number of trees, minimum samples split, and minimum samples per leaf for the random forest regressor.

To insure the hyperparameters were sufficiently optimized, I started the grid search with a more broad range of hyperparameters and narrowed the search as the results came back. Optimizing for both Rยฒ, mean squared error and mean average error I was able to find the best model for minimizing each of the loss functions.

Not surprisingly, the linear regression performed better than the random forest model when optimizing for each different loss function. This could have a few different explanations but most the most likely reason is that the relationship between house prices and the explanatory features is a linear relationship and therefore linear regression is best at explaining the relationship. The predictions submitted to Kaggle using the linear regressor and random forest models resulted in a RSME of .14 and .20 respectively.

Cam

I applied base models from the sci-kit learn package including: ElasticNet, Lasso, Kernel Ridge, Gradient Boosting, and XGBoost, and LightGBM. The models were made robust to outliers by applying sklearn's Robustscaler() method within the make_pipline() method. By doing so, I hoped to reduce the sensitivity of each model to potential outliers. After evaluating the cross-validation rmsle errors for each model the results are as follows:

  • Lasso: .1129
  • ElasticNet: .1115
  • Kernel Ridge: .1166
  • Gradient Boosting: .1192
  • Xgboost: .1161

Conclusion and Future Work

While our models tested above proved to be fairly accurate, their performance kept us just outside of the top 20% of the submissions for the active Kaggle competition. Many of the higher level submissions seem to utilize stacked models and variations of random forest regressors which seem to perform well on Kaggle overall.

Our future work to improve our model will include building robust stacked models. One such approach would be to build a new class that ultimately takes the average of the performance of multiple models accepted as arguments. Follow our github to review our python code and check back in periodically to view updates to our research!

About Authors

cameron wisdom

Cameron Wisdom is a Data Science Fellow with nearly a decade of operations management and business leadership experience. He fell in love with data as a tool to make better business decisions, and is a champion of data-driven...
View all posts by cameron wisdom >

Jonathan Presley

Jonathan Presley, MPH is a data scientist fellow dedicated to efficiency, community wellness, and service. With nine years of combined experience in scientific research, biotechnology, education, and healthcare, he has developed a variety of skills in research, policy,...
View all posts by Jonathan Presley >

Aydan Ellman

Data Science Fellow and Cornell University graduate with demonstrated experience in business development and the application of data analysis to drive business performance.
View all posts by Aydan Ellman >

Kyle Greeley

Kyle Greeley has a B.S. degree in Kinesiology from Springfield College. While teaching mathematics in Texas and NYC, Kyle discovered his real passion for understanding data. Through NYC Data Science Academy, Kyle has become a machine learning enthusiast...
View all posts by Kyle Greeley >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application