NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > House Prices: Advanced Regression Techniques

House Prices: Advanced Regression Techniques

Joseph C. Fritch and Stella Kim
Posted on Mar 12, 2019

Introduction

For most people, buying a house is the single largest investment decision they will make. Prospective buyers typically require years of saving to generate a down payment and are then committed to making monthly mortgage payments over a thirty year period. Moreover, taxes, insurance, electricity, heating, cooling, and home maintenance are additional costs associated with home ownership. For potential buyers, it's advantageous to know where house value may be derived from and what a prospective house is worth. Similarly, existing homeowners looking to sell and maximize investment would benefit greatly in their decision-making process if they can predict the price of their home prior to listing it.

The project explores advanced regression techniques to predict housing prices in Ames Iowa. The Ames Housing dataset was compiled for use in data science education and can be found on Kaggle. The data includes 79 features describing 1460 homes. While house price prediction is the main focus, the authors also aim to answer the following questions:

What features are most correlated with housings prices?

What features are least correlated with housing price?

How does the feature engineering impact model performance?

Which machine learning models perform best predicting price?

Various imputation methods are employed to address missingness. Transformation and standardization are used to increase model performance. 5-Fold cross-validation is used to tune Hyper-Parameters on various models.

All machine learning methods are applied in Python using scikit-learn package. R is used for supplemental statistical analysis and visualization.

Data can be found: https://www.kaggle.com/c/house-prices-advanced-regression-techniques#description


All code can be found here: https://github.com/Joseph-C-Fritch/HousingPrice_ML_Project

Initial Data Analysis

One of the first and most important steps of data analysis is to look at the data. From the count, we can already see that there are missing values in the dataset. We can also see that the difference in mean and standard deviation across variables is quite large.

Next, we take a look at variables with high skew and kurtosis, which can distort the results of our prediction model if the skew is not taken care of appropriately.

Missingness & Imputation

Next, we can take a look at missingness in the dataset. The PoolQC, MiscFeature, Alley, and Fence are missing in 80% of cases; FireplaceQua is missing in 49% of cases; LotFrontage is missing in 17% of cases; GarageFinish, GarageQual, GarageCond, GarageYrBlt, GarageType are missing in 5% of cases; and 23 variables are missing in less than 5% of cases.

In most cases, we were able to fill in the missing value with the NA/โ€0โ€ if the feature was not available for the particular home, or with the most common value. In other cases, we took a more targeted approached and filled in values based on neighborhood, or based on existing quality and condition information (especially for Garage and Basement values).

Outliers

We also looked at outliers in the dataset to see for any observations that would clearly distort our model. Looking at the relationship between Total SF and Sale Price, we can see that there are two outliers, and the removal of these two observations increases the correlation of total square footage and sale price from 0.78 to 0.83.

Transformation and Standardization

Among the numerical features, we chose to transform LotFrontage, TotalSF, and SalePrice, either log or Box-Cox transformations. Here, we can clearly see that our target variable, Sale Price, is heavily skewed to the right, so we can apply a simple log transformation so it follows a Gaussian distribution.

After transforming our skewed variables and removing outliers, we then standardized our numerical features, which is essential for our models to accurately determine the size and importance of regression coefficients.

Feature Engineering

Many categorical features consist of subcategories containing fewer than twenty-five observations. To avoid overfitting, sparse categorical variables are consolidated.  For example, the feature MSSubClass includes sixteen subclasses with six of them containing less than 25 observations.  A histogram of MSSubClass is illustrated below.  These sparse categories are combined into a single sub-class referred to as โ€œotherโ€.  A similar procedure is conducted on other categoricals features that meet the defined sparse criteria.

Ten categorical features describe quality and condition ratings for the various house attributes.  For example, basement quality and condition are represented with a rating of either poor, fair, average, good or excellent. Such ordinal categorical variables are quantified with a score from one to five retaining their order and avoiding dummification. A graphical representation of this relationship is illustrated below.

To increase predictor strength while simultaneously reducing complexity and multicollinearity, interaction features are introduced.  A scoring system is developed which combines quality and condition ratings with square footage.  For instance, basement quality, basement condition, and total basement square footage are multiplied together to create a new feature basement score.  A similar procedure is carried out on the exterior, garage, kitchen and overall features.  A summary of notable interaction features is shown below.

The following correlation heat maps illustrate multicollinearity and predictor strength before and after interaction feature creation.  It can be seen that feature engineering has reduced multicollinearity between predictors while increasing predictor strength with the house sale price.

Modeling

Lasso

Regularized regression is explored using the Lasso method. The l1 penalty is applied and hyperparameter tuning of lambda is conducted using five-fold cross-validation.  Scikit-learnโ€™s built-in Lasso and LassoCV are used. Optimal alpha is approximately .01.  The plot below illustrates beta coefficients as a function of hyperparameter alpha. 

As expected, as alpha increases, many of the coefficients are forced to zero according to the l1 penalty.  Feature selection is carried out by Lasso and the resulting features are shown in the bar chart above.  A total of 49 features are selected and the newly created interaction features provide amplified predicting power as they constitute four of the five top coefficients by absolute magnitude.  The resulting cross-validated RMSE is .13666

Elastic Net

Regularized regression is explored using the Elastic Net method. The l1 and l2 penalties are applied and hyperparameter tuning of lambda and rho are conducted using five-fold cross-validation. Scikit-learnโ€™s built-in ElasticNet and ElasticNetCV are used. Optimal alpha is approximately 0.026 and optimal rho is 0.25  The plot below illustrates beta coefficients as a function of hyperparameter alpha. 

As alpha increases, some of the coefficients are forced to zero according to the l1 penalty but many more only attempt to converge to zero given the l2 penalty. Therefore more features are selected in the process. The top twenty resulting features are shown in the bar chart above.  A total of 71 features are selected and the newly created interaction features again provide amplified predicting power as they constitute four of the five top coefficients by absolute magnitude.  The resulting cross-validated RMSE is 0.13904.

Extreme Gradient Boosting

Extreme Gradient Boosting is used as an alternative tree-based modeling method. A grid search is used to perform five-fold cross-validation to tune multiple hyperparameters.  XGBoost and Scikit- learnโ€™s built-in GridSearchCV are used. Optimal tuning parameters are highlighted below.  The plot below illustrates beta coefficients as a function of hyperparameter alpha. 

The top twenty features of importance are shown in the bar chart above.  The features selected are slightly different from those selected in Lasso and Elastic Net but again the newly created interaction features provide amplified predicting power as they constitute four of the six top features of importance.  The resulting cross-validated RMSE is 0.13336.

Feature Engineering - Modifications

A second iteration of feature engineering is conducted. There are heavy correlations between Garage Area and Garage Cars and between 1stFlrSF and TotalBsmtSF. We can combine aspects of these features in order to reduce some multicollinearity and create features with a higher correlation with Sale Price without losing any information.  The scoring system used previously is removed and sparse categorical variables are not reduced.

To reduce some of this multicollinearity, we reduced our square footage information into one variable, TotalSF, by combining 1stFlrSF, 2ndFlrSF, TotalBsmtSF. Similarly, the number of bathrooms are combined into one variable, Bath. All porch square footage is summed into one variable, PorchSF. GarageArea and GarageCars are multiplied together to make one variable that had a stronger correlation with Sale Price.

After combining correlated features together, filling in missing values, transformation, and standardization, we can see that the multicollinearity is decreased among numerical variables, but correlation with Sale Price remains high.

Ridge Regression

Ridge regression is another form of regularized regression (https://www.statisticshowto.datasciencecentral.com/regularizatio /) that utilizes a penalty term to reduce complexity of a model. The L2 regularization term adds a penalty that is equal to the sum of the squares of the beta coefficients, thereby forcing the coefficients to shrink towards zero. While the coefficients are not actually forced to zero, ridge regression helps to combat the issue of multicollinearity (http://www.stat.cmu.edu/~larry/=stat401/lecture 17.pdf) between different features (which may result in inaccurate estimates and increased standard errors of coefficients) by reducing the variance of the model  (although increasing the bias).

To optimize the hyperparameter alpha (or lambda, depending on the terminology), we used a five-fold cross-validation scheme within the range 1e-6 to 1e6, which returned an optimal value of 18.73, and our best Kaggle RMSE with ridge regression alone was 0.12112.

We can see that our residuals are normally distributed, but there are some clear outliers that can be removed. However, removal of these outliers tended to result in a worse Kaggle RMSE, which indicates that there is pertinent information in at least some of these observations that should not be removed.

We can see that our ridge regression model performs quite well on our training data, and there is an overlap between actual sale price (on train data), predicted train sale data, and predicted test sale data is quite

Support Vector Regression

We also used Support Vector Regression (https://www.saedsayad.com/support_vector_machine_reg.htm) with a radial-based kernel and used five-fold cross-validation to optimize the hyperparameters epsilon, cost, and gamma. Much like any other regression technique, the goal of support vector regression is to minimize the amount of error. Gamma is similar to K in K-nearest-neighbors, with a small value of gamma taking global information into account, and a large value of gamma taking local information into account. The cost parameter is similar to lambda in the previously-mentioned Ridge, Lasso, and Elastic Net models and dictates the influence or size of the error term. The epsilon parameter controls the epsilon-insensitive region, which defines the margin of tolerance where there is no penalty associated with observations within the region. Hyperparameter optimization gave an optimal epsilon of 0.1, cost of 10, and gamma of 0.001, resulting in a Kaggle score of 0.12079.

Gradient Boosting Regression

Gradient Boosting Regression is an ensemble model which builds successive decision trees. While each individual tree is a weak learner (does not perform well on test sets), successive trees attempt to correct for these shortcomings by fitting to the residuals of the previous tree. Using five-fold cross-validation and a learning rate of 0.05, we found our optimal number of estimators to be 2000 (total number of boosting stages) and our optimal depth of each tree to be 2 (2 nodes), giving a Kaggle score of 0.12348.

Gradient boosting regression selected 152 features to be important in the prediction model, with TotalSF, KitchenQual, OveralQual, and Baths (unsurprisingly) being selected as some of the top features.

Conclusion

Finally, we generated a model based on the average prediction (https://en.wikipedia.org/wiki/Ensemble_averaging_(machine_learn ing)) of Ridge, Lasso, Elastic Net, and Gradient Boosting Regression, figuring that the errors made by each individual model may be mitigated when weighted on average, and found that this model actually had the best Kaggle RMSE of 0.11973.

In general, ridge regression seemed to outperform the other models with feature engineering, while support vector regression gave us our best individual score by inputting all features without engineering.

Some ways that we can create a more accurate predictive model are (1) to carefully choose feature engineering to remove more multicollinearity and create stronger predictor variables, (2) to identify outliers and be mindful when deciding to remove them, (3) to tune more hyperparameters (i.e. testing a polynomial kernel for SVR), and (4) try stacking, ensembling, and weighing models in various ways.

About Authors

Joseph C. Fritch

Data Scientist and Control Systems Engineer with 5 years experience in the energy analysis and building automation space. Interests include machine learning and its applications in controlling dynamic systems.
View all posts by Joseph C. Fritch >

Stella Kim

Stella Kim is a data scientist with 4 years of experience using R, a Master's in Biotechnology, and PhD experience in Cancer Biology and Computational Genomics. Proficient in R, Python, and SQL. Passionate about data analytics, visualization, machine...
View all posts by Stella Kim >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application