NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > USING DATA ALGORITHM TO PREDICT HOUSING PRICES

USING DATA ALGORITHM TO PREDICT HOUSING PRICES

Guillermo Ruiz
Posted on Apr 9, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

INTRODUCTION  

The aim of the project is to find the data algorithm that can best predict housing prices in Ames, a city in the State of Iowa. Ames is a growing city in Iowa that has experienced some recent real estate developments. This project intends to provide insight that can inform decision making in real estate companies operating in this market. 

In the following, I will describe the different steps that were taken to arrive at a functional algorithm and its possible business applications. Some of them are:

  • Better price their housing assets, providing a benchmark that encapsulates the information in the market.
  • Provide insight on the type of housing, relevant features, that will be more profitable to build in future real estate development in the area.

DATA

The data is the Ames housing database provided by kaggle.com. The data consists of 1460 observations and 79 features describing houses in the area. In the following, I will describe the different steps that were taken to arrive to the best performing algorithms and their possible business applications.

DATA PRE-PROCESSING

The data will need some treatment before we can start thinking of implementing machine learning models.

Splitting the dataframe

First, we split the data frame into training and testing sets. Every step in the project will be applied separately to each of them. This way, we make sure that the performance we get from our models can be better generalized to work with data the model has not been trained in.

Imputing missing data

Now we can have a look at whether there is missing data in any of our datasets. We find that there are 5568 missing values in the training set concentrated in 1168 different rows. In order to check how evenly distributed missing values are across features, we plot the proportion of missing values per variable:

Figure 1: Proportion of missing values in the training data set.

USING DATA ALGORITHM TO PREDICT HOUSING PRICES

We can see that  though there is a high number of missing values, they are spread over a  few variables that have a high proportion of missing values. When imputing, we followed these  steps:

  1. Analyze the variables individually and think if some of the missing values have a meaning of their own. For example, the missing values in the feature PoolQC probably indicate the fact that those houses have no pool rather than any other possible issue with the data. The same can be said of the garage and basement variables, given that all of these features have missing values in the same observations, probably indicating that those houses have no basement or garage. This was later confirmed by checking into the variables description available in Kaggle. In all these variables, 'None' gets imputed.
  2. For numeric variables the median was imputed. In the case of LotFrontage, the data was first grouped by neighborhoods, as we expect the linear feet of street connected to property to be more similar within neighborhoods.
  3. For the remaining categorical variable (MasVnrType), we impute the mode of the variable.

Imputation with k-nearest-neighbors (knn) was considered but finally discarded. The reason for this is that this method of imputation is dependent on the feature space that is formed by our data. Because of this reason, having numeric variables is important, as the distance between the points reflect the internal logic of the data. Given the fact that we have categorical variables, we have two options:

  • Impute using only the numeric features: the problem would be that this can lead to an unrealistic imputation if categorical features play an important role.
  • Impute using all variables after dummifying the categorical ones: the problem would be that this greatly increases the extension of the feature space, again jeopardizing an effective knn imputation.

Now, we check for the testing set. There are  1397 missing values concentrated in 292 rows. Again, we plot the proportion of missing values in each variable:

Figure 2: Proportion of missing values in testing data set.

USING DATA ALGORITHM TO PREDICT HOUSING PRICES

As we could expect, the distribution of missing values is almost equal to the training set. The imputation rules followed here were mostly the same as in the training test. However, some differences apply:

  • The training was done in the training set and then applied to the testing one. This implies that the median or mode imputed on the testing set were calculated on the training set.
  • Electrical is a feature with only one missing value. By chance, it fell in the testing set when the split was performed. The mode from the training set is imputed.

Data on Feature engineering

A correlation matrix was used to try to identify multicollinearity between the variables (see Figure 3).

Figure 3: Correlation matrix before feature engineering.

USING DATA ALGORITHM TO PREDICT HOUSING PRICES

New Variables

In an attempt to reduce multicollinearity, we created new variables that encapsulated the information of several others. They consisted of the following:

  1. Id: this feature is just an index so any correlations with the dependent or any other variables is random. Therefore, it is dropped.
  2. TotalSF: this new feature is created by adding up other variables that add up to the total square footage of each house/observation.
  3. Total_Bathrooms: this new feature includes the number of bathrooms in the house/observation.
  4. Total_porch_sf: new feature that includes the total square footage of the porch in each house/observation.
  5. YrBltAndRemod: this variable takes the number of years since the house was built or  remodeled.
  6. haspool: dummy variable reflecting whether there is a pool.
  7. hasgarage: dummy variable reflecting whether there is a garage.
  8. has2ndfloor: dummy variable reflecting whether there is a second floor.
  9. hasbsmt: dummy variable reflecting whether there is a basement.
  10. hasfireplace: dummy variable reflecting whether there is a fireplace.

This feature engineering reduces multicollinearity between the explanatory variables. Given the large number of features, reducing multicollinearity to zero is not realistic. The current levels of multicollinearity are visualized in the correlation matrix in Figure 4.

Figure 4: Correlation matrix after feature engineering.

Outlier detection

We can start by plotting the dependent variable, the price of the houses ('SalePrice'), and the total square footage ('TotalSF'). As we can see in figure 5, we can visually identify observations that may be outliers. However, we will adopt a more systematic approach, implementing Cookยดs distance and also checking the observations that fall outside the interquartile range (IQR).

Figure 5: Scatter plot of total square footage and housing prices.

First, Cookยดs distance is implemented. The output can be seen in figure 6. The observations that are signaled out are included in a list.

Figure 6: Cookยดs distance plot.

We also try to identify outliers by checking whether they fall out of the inter quantile range (IQR). These outliers are included into a different list. Afterwards, we decide to drop the outliers that appear in both lists. The reason we follow this approach is that the Cookยดs distance assumes an ordinary least square regression in order to identify outliers. As we also plan to implement different linear (lasso and ridge in particular) and non-linear models, we decided to also use the IQR. This approach allows for a more conservative treatment of outliers.

This is important because we don't always want to remove outliers; they  only are eliminated when they reduce the predictive performance of the model. Following this double evaluation we limit the observations we drop to the most clear outliers, the houses that are more set apart and thus less likely to be reproduced in later developments in Ames.

Distribution of numerical variables

We checked the distribution of our continuous variables (you can see a sample of them in figure 7).

Figure 7: Distribution of a sample of numeric variables before transformation.

Numeric variables are right skewed, which may affect the performance of the linear models. In order to avoid this, we apply a Yeo-Johnson transformation. There are three reasons why the Yeo-Johnson transformation is chosen:

  1. Many of our features have zero values and the Box-Cox transformation demands strictly positive values.
  2. Yeo-johnson transformation allows us to train in the training data and then apply to the testing data. Therefore, it keeps the rigour we need when later realistically evaluating our model.
  3. Yeo-Johnson transformation also scales the data and centers it around zero in our numerical variables.

We can see the results of applying this transformation in figure 8 below:

Figure 8: Distribution of a sample of numeric variables after transformation.

Dummifying the categorical variables

The categorical variables are dummified in order to be able to implement the linear models.

Target variableยดs transformation

The distribution of the dependent variables is right skewed, which will bias the  implemented linear models.. Thatโ€™s why we perform a logarithmic transformation on the dependent variable. We then check that it is now normally distributed in both the training and testing datasets.

Figure 9: Logarithmic transformation of training setยดs target variable (after transformation on the right).

Figure 10: Logarithmic transformation of testing setยดs target variable (after transformation on the right).

MACHINE LEARNING MODELS

Multiple linear regression model

We start by trying a multiple linear regression model. R-squared for the training set is 0.9569 and 0.9112 for the testing set. The divergence between these scores may be due to overfitting, that is, the model performing better on the data it has been trained in than in the unseen data in the testing set.

The assumptions of the linear model are checked below:

Figure 11: Residuals plot (mlr) Figure 12: Q-Q plot (mlr).

As we can see, the model suffers from a certain degree of heteroskedasticity, and the residuals are not strictly normally distributed. The residuals however appear to be independent. This is confirmed by implementing a Durbin-Watson test[1].

We also know from the correlation matrices that our model suffers from a certain degree of multicollinearity. The fact that the normality assumptions does not hold should make us skeptical of the reliability of the model. Part of the reason this is happening, and will also happen in different linear models, is the number of zeros that we had in our variables that were not strictly normalized by the Yeo-Johnson transformation.  On account of that, we will compare the predictions of the model against the prices in our testing dataset that we are trying to predict. This allows an empirical comprobation. We try three different measures:

  1. Mean squared error (MSE)   =  0.01493
  2. Mean absolute error (MAE)  =  0.08406
  3. Root mean squared error (RMSE)  =  0.12222

As explained before, our multiple linear regression model suffers from multicollinearity. In order to solve this problem, we move on to try regression models with regularization properties. We will implement cross-validated lasso and ridge models in order to also correct for possible overfitting in the models.

Lasso regression

The cross-validated lasso regression has a training R-squared of 0.92946 and a testing set score of 0.92196. The divergence between both scores has decreased in comparison to the multiple linear regression, suggesting that the latter may have suffered from overfitting.

Again, the model suffers from some heteroskedasticity, and the residuals are not exactly normally distributed. The Durbin-Watson[2] test suggests again that the residuals are independent.

Figure 13: Residuals plot (lasso) Figure 14: Q-Q plot (lasso).

We now evaluate the accuracy of the predictions:

  1. Mean squared error (MSE)   = 0.01313
  2. Mean absolute error (MAE)  = 0.07634
  3. Root mean squared error (RMSE)  = 0.11460

Ridge regression

The cross-validated ridge regression has a training R-squared of 0.92946 and a testing set score of 0.92196.

As with lasso, the model suffers from some heteroskedasticity, and the residuals are not exactly normally distributed. The Durbin-Watson[3] test suggests again that the residuals are independent.

Figure 15: Residuals plot (ridge) Figure 16: Q-Q plot (ridge).

 

 

The evaluation of the predictions is as follows:

  1. Mean squared error (MSE)   = 0.01252
  2. Mean absolute error (MAE)  = 0.07323
  3. Root mean squared error (RMSE)  = 0.11190

Gradient boosting

In addition to the linear models implemented above, we also try non-linear, tree-based, models. There may be non-linear relationships in our data that our tree-models are able to capture, perhaps offering better performance that the linear models tried above. We will start by implementing gradient boosting.

We also cross-validate our gradient boosting model in order to avoid overfitting. The R-squared in the training set is 0.93337 and the one from the testing set is 0.88656. The evaluation of the predictions are:

  1. Mean squared error (MSE)   =  0.01909
  2. Mean absolute error (MAE)  =  0.08601
  3. Root mean squared error (RMSE)  =  0.13817

Random forest Data 

Random forest in the other non-linear model that we try. The R-squared in the training set is 0.98281 and 0.86796 in the testing set. The evaluation from the predictions are:

  1. The mean squared error (MSE) is 0.02151
  2. The root mean squared error (RMSE) is 0.14666
  3. The mean absolute error (MAE) is 0.09125

CONCLUSIONS

The model that performs best is the ridge regression. The R-squared in the testing score is the highest, which implies that this model is the one that explains the most variance in our data. Also, the predictions are the best ones according to the evaluation metrics that we have selected. They suggest that our model can predict housing prices with a margin error of 10%, which is the industry's standard.

This information can help inform decision-making at the business level. As stated above, it can provide insight on the pricing of real estate assets just by plugging the house's characteristics in and letting the model return a price. In addition,  it can provide information on which features of a new house are more valuable for potential customers.

We found the following insights about the effects of different features on housing prices that can be useful for planning purchases or renovations for buyers, sellers, and developers :

  1. The variable  that has the biggest causal effect on prices is total square footage.
  2. The second most influential feature  is OverallQual, which reflects overall material and finish quality. These are expected results.
  3. Proximity to the main road or railroad is also within the top 5 features.  This can be important in case of new developments.
  4. There are 3 features reflecting different neighborhoods between the top and  bottom 5 features that have an effect on price. This speaks to the importance of where the new developments take place. Crawfor seems to be the neighborhood that positively affects the prices most. This does not mean that it is the neighborhoods with the most expensive housing, just that building in that location has the biggest positive causal effect.
  5. A garage condition of 'fair' has little impact on the price of the house (fifth from the bottom). On the other hand, garages that are ranked as 'Average,' which is  just slightly higher quality than 'fair' have a significantly higher effect on the price of the house. This suggests that trying to upgrade low-quality garages to average quality may be a worthwhile investment.

Another insight comes from the fact that an unfinished basement has little effect on the price of houses. When remodeling houses, bear in mind that there is not much ROI for finished basements.

Sources Used

The relevant code and data for this project, including web scraping and analysis, can be found here.


[1]1.8999

[2] 1.924

[3] 1.883

About Author

Guillermo Ruiz

Data Science Professional and Economist with a demonstrated history of data analysis and machine learning modeling with a focus on storytelling with data. Passionate about helping companies to gather and analyze data to make more informed decisions to...
View all posts by Guillermo Ruiz >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Data Analysis
Car Sales Report R Shiny App
Data Analysis
Injury Analysis of Soccer Players with Python
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application