NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Price, Machine learning Housing Sale Price Prediction

Price, Machine learning Housing Sale Price Prediction

Qinghua Li
Posted on Sep 25, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

The purpose of this project is to build a model to accurately predict the sale price of the house. The dataset from Kaggle House Price Prediction Competition uses the Ames Housing dataset, compiled by Dean De Cock for data science education. With 79 explanatory variables that describe  (almost) every aspect of residential homes in Ames, Iowa, this competition challenges students to predict the final price of each home.

My primary goals of this project are to achieve:

  • Exploratory Data Analysis
  • Data Cleaning
  • Creative feature engineering 
  • Advanced regression models applied to the training data to accurately predict the sales price on the test data set

The Data Set: 

The training set contains 1460 rows and,  81 columns, with 80 as features and one as target (the SalePrice). The test set contains 1459 rows and 80 columns.

There are 19 variables in the training data set which have missing entries. The magnitude of the missingness is shown in Figure 1. Test data set has 33 variables with missing entries, the distribution is similar to the training set. 

Figure 1

In this analysis, I am focusing on the distribution of the target data and the  relationship between the features and the price. Figure 2 shows the histogram of the SalePrice, our target variable. It is not normally distributed, but skewed to the right side.  This needs to be rescale in order to have an accurate prediction with any ML model.

                 Figure 2 SalePrice Distribution

A heatmap in Figure 3 shows that the SalePrice are highly correlated with OverallQual, GrLivArea. This is very positive for developing an accurate model. 

             Figure 3 heatmap of SalePrice with other numerical variables

To dive deeper into the analysis, I plotted a couple of charts, Figure 4 and Figure 5. As Figure 4 shows, there seems to be a few outliers that need to be addressed. The relationship between OverallQual and SalePrice is a confirmation of their positive correlation characteristics.

Figure 4 Scatter Plot of SalePrice vs GrLivArea


               Figure 5 Bar Plot of SalePrice vs OverallQual

Data Processing

Missingness Imputation

Most of the missing entries are due to the house lacking a feature: for example, garage, basement, pool, fence, alley. I can safely replace them with 0 if they have a numerical value, or none if they have categorical value. Other variables such as LotFrontage, masVnr, I will fill it with the neighboring median value.

Skewness

To address the skewness that some of the variables presented, I need to apply some form of transformation to de-skew them so that they will have a nice-looking normal distribution. Both log transform and boxcox from transform works well, and I choose to use the latter method. 

Figure 6 and Figure 7 shows the before transform and after transform of GrLivArea Probability Plot. We can see before the transform, the probability plot was a curving shape for GrLivArea, but after the conversion, it changed to a close linear shape. 

Figure 6 Probability Plot Before transform of GrLivArea

Figure 7 Probability Plot of GrLivArea after boxcox Transform

I performed a log transformation on SalePrice. The outcome histogram shows it is close to a normal distribution.

Figure 8 Sale Price Distribution after Log Transform

Outliers

Outliers have a great impact on the models, it can skew the results of the prediction and so diminish its accuracy.. To address the issue, I use zscore function from  scipy.stats to rank all entries from the numerical variables that have high correlations to SalePrice, i.e. GrLivArea, OverallQual, and filtered data points with high absolute zscores (>3.5), this process identified 4 outliers as shown in the next table.

I made my analysis on two groups, the first group with only two outliers from the largest of the โ€˜+โ€™ and โ€˜-โ€™ ends of the zscore list, (1298, 533). The second group has all the four outliers (1298, 523, 1100, 533).

Feature Engineering

1. Drop Features

Street, Utilities, ID are mostly irrelalivant to sale price, so  it is safe to drop them from the data set. PoolQC has too much missingness, so it is safe to drop it as well.

2. Convert to Categorical Features

It would be better to have some of the numerical features in the original dataset in the categorical group,  so the model can more easily spot them. If they stay in the numerical group, they will most likely be buried to the bottom. The following list are the conversions from numerical to categorical.

  • 'KitchenAbvGr', 
  • 'TotRmsAbvGrd',
  • 'BedroomAbvGr', 
  • 'GarageCars', 
  • 'HalfBath',
  • 'BsmtHalfBath',
  • 'FullBath',
  • 'BsmtFullBath',
  • 'Fireplaces'

After the conversion, the final data set has 49 categorical and 27 numerical features

3. New or Combine Features

I implemented a feature_engineering function to create new features by combining all the floor space features into one single feature TotalSF, Combining all the bath room features into one single feature TotalBath, add a few new features such as HaveFirePlace, HaveSecondFloor, etc. 

4. Dummify Categorical Features

Categorical data needs to be converted numerical data by pandas get_dummies before I can use it to train any linear regression model. After the conversion, the datasetโ€™s categorical features expanded from 49 to 309. But I used the drop-most technique to remove 49 entries from the categorical variables, this generated a training dataset (1458, 287) for two outliers group, (1456, 287) for 4 outliers group.

Modeling

Ridge Regression

Lasso Regression

ElasticNet Regression

Linear Regression Models include Ridge, Lasso, ElasticNet regression. Data for these models need to be prepared with dummyfication for the categorical variables. Then the training data was split into 80%:20% training test set. 80% used for training the models, 20% for validating the models. 

The hyper-parameter for the linear models are simple; there is only one parameter for tuning in ridge or lasso models. ElasticNet has one extra parameter l1-ratio, so tuning takes more time than ridge and lasso. 

The grid search function GridSearchCV from sklearn comes in handy, allowing me to specify the steps and range of the parameter, and set the number of folds to 5. That means the grid search function will split the 80% training data further to 5 parts, 4 parts for training, and 1 part for validation. It then uses the whole 80% training data to refit the model. This final model will be used for validation check.

Support vector Machine 

Random Forest

Gradient Boosting

XGBoost

These models are non-linear models, and do not require dummification preparation for the training data.  But the categorical data (usually a string type) need to be label encoded with sklearn LableEncoder. The training data will then went through the same process, split into 80%:20% training test sets. These models have a lot more hyper parameters, and their tuning is more complicated.

I grouped the parameters into 2 - 4 parts, with each part having 1-3 parameters. I tuned one group at a time, starting with fewer or coarser steps and a larger range, then narrower range and finer steps. If the parameters are sitting at the range boundary, I would expand the range and re-run the search. Once all the parameters in one group were optimally selected, I then moved to the next group and repeated the same steps. This process is very tedious and time consuming.

Once all the parameters are tuned for each model, I then applied the 20% test dataset to the model to check for overfitting. 

Blending Models

Using a weighted average of all the models above, I can create a blended model. The weights was tuned using a top 1% submission as a training set,  the mean square error root as the loss function. Here is the  snippet of the code

Results

Models are compared using the metrics of normalized mean square error (MSE). The following tables shows the normalized MSE on all the models in cases of 2, 4 outliers removed. We can see all models are benefiting from outlier removal. Case 2 (4 outliers removal) definitely has much better performance than Case 1 (2 outliers removal). 

Linear models are more robust than non-linear models, they are constantly producing better results. Fancy feature engineering like creating new or combining features may not be necessary and it did not make a big change to the model. 

Case 1: MSE with two outliers removed

      

Case 2: MSE with 4 outliers removed

Case 3: MSE with 4 outliers removal +feature engineering.

  

Feature Importance

Figure 9 shows the top 20 positive/negative important coefficients from the ridge model. Figure 10 shows the important features extracted by the Gradient Boosting model. We can see that the living space, quality,  and year built have positive impact on sale price of the houses. Bad zoning, no air conditioning, bad condition status have negative impact on the sale prices.

Figure 9 Linear Model Coefficients


Figure 10 Gradient Boosting Feature Importance

Based on the MSE value of all the models, I created a combined model using weighted factor. By tweaking the factor on each model, and compare the MSE with a top 1 submission in the kaggle competition, I was able to find an optimal combination for each outlier group. The following table shows my Kaggle scores. The best score I obtained is 0.11697, which put me within the top 16.2% of the participants in the competition.


Summary

Linear models such as Ridge, Lasso, ElasticNet are best for prediction, as they are simple to implement and easy to tune. But they are easily influenced by outliers. 

Tree models such Random Forest, Gradient Boosting, and XGBoost are less prone to outliers, but they presented a challenge to overfitting. 

A blended model from weighted average can minimize the above problems and produce a better result, but finding the right weights are challenging. 

Data preprocessing such as boxcox transform, outlier removal, are critical in training an accurate model. 

Future Work

Instead of using a top 1 submission for training my blend model for the parameters of the weighted average, I should investigate model stacking. I should be able to ensemble a set of models which uses weighted average to generate a blend model. I should be able to apply the same Grid Search to turn the weights using k-fold Cross Validation with training set.

The source code is in github

About Author

Qinghua Li

Qinghua Li has a MS degree in electrical engineering, and specialized in mobility network system integration and testing. She has extensive years working in 3G/4G wireless cell site, circuit board level, system level integration and testing, automation, data...
View all posts by Qinghua Li >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application