NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Studying Data to Predict Housing Prices in Ames

Studying Data to Predict Housing Prices in Ames

Arthur Yu, Will Thurston, Jason Lai and Soomin Park
Posted on Sep 4, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

     The Ames Housing data set supplies sale price information for close to 3000 homes in Ames, IA, depending on some 79 features. Such a feature-rich dataset provides excellent opportunity to apply machine learning techniques to predict the sale price of houses. As such, the main goal of this project is to explore the various predictive models and gain a better understanding of the mechanisms behind them. In the course of finding the model that gives us the most accurate results, we hope to acquire deeper insight into the different models we used, the dataset itself, and the process as a whole.

Data Exploration and Cleaning

     As a first step to finding a suitable model to relate the sale price of houses to the other variables, we explored the data to get a better sense of the different features present and how to handle them. One of the first things that we noticed was the histogram of the target variable, sale price, when plotted, is clearly not normally distributed.

Studying Data to Predict Housing Prices in Ames

Distribution of Sale Price

     The plot has a distinct rightward skew. Such a skew is not surprising, since you would expect there would be more houses with higher than mean prices than otherwise. Given this information, if we were to do a linear fit of the sale price, we should log transform the data first so that the column is actually normally distributed.

 Correlation Between Various Features and Sale Prices

   Next, we investigated the correlation between the various features and sale price. The top ten features with a correlation of 0.5 or higher with sale price were plotted and we noticed many of these features were not only highly correlated with sale price but also with each other. As we do feature selection and engineering later on, this is something we should keep in mind.

Studying Data to Predict Housing Prices in Ames

Correlation Matrix for Top Ten Features with Sale Price

     Continuing our investigation of the dataset we looked for missing values and figured out how to deal with them. The first figure is a graph showing percentage that each column is filled(not null) and the second is the actual number of missing values for each column. As we can see, there are quite a lot of missing values in several columns. In general, we classified them based on what the missing value represent and impute them accordingly.

Studying Data to Predict Housing Prices in Ames

   Observations

 The majority of the missing data correspond to โ€œNo such Featureโ€ and we impute โ€˜Noneโ€™ or 0 for them.  With the other missing values we took a more granular approach. For the missing numerical features we grouped the data according to its corresponding Neighborhood, and then imputed the mean or median for the Neighborhood, whichever seem more appropriate. This is reasonable as we expect houses in each neighborhood should share similar features.

For the missing categorical features we grouped the data similarly but imputed the mode of the Neighborhood instead. With these steps we were able to take care of the majority of the missing values. There were a few special cases which we handled individually, such as the โ€œGarage Year Builtโ€, which we imputed the same year as the house built year, and kitchen quality, which we scaled according to its overall house quality instead of going by neighborhood average.

Feature Engineering

     For feature engineering we began with the simplification of features.  Because many features are related to each other, and are highly correlated as shown in the previous chart, we condensed many columns into one.  Altogether, we built five new features:

  1. Total Bathrooms: Sum of Above ground full and half baths and Basement full and half bath.
  2. House Age: The difference between Year Sold and Year Remodeled
  3. Remodeled: Binary value representing if the has remodeled year was different than house built year
  4. Is New: If Year Sold equals Year Built
  5. Neighborhood Wealth: A categorical value (1-4) of different groups of houses based on disparities in their neighborhoods median wealth.

New Engineered Features

Data Transforming / Scaling

     After noting that several of the variables such as Ground Living Area showed a mostly linear relation to sale price, we decided we were going to use linear models to fit the dataset. We were unsure whether it will be the best method, but we wanted to give it a try.  Due to this fact, we need to scale our data as well as transform our categorical values to dummy variables. For this we simply used Scikit-Learnโ€™s โ€œStandard Scalerโ€ method to scale the data (subtract the mean and divide by the standard deviation) and Pandas โ€œGet Dummiesโ€ method to one hot encode our categorical features.  As mentioned earlier, we also performed a log transformation on the target variable to normalize it.

Modeling

Below is a list of methods we used:

  1. Linear Models
    1. Ridge Regression
    2. Lasso Regression
    3. ElasticNet Regression
    4. Support Vector Machine
  2. Tree based
    1. Random Forest Regression

The results are summarised in table below. A detailed discussion of each model will followed.

Results Summary

Linear Models: Regularized Linear Regressions

     The linear models we tried were regularized models, such as Ridge, Lasso, and the ElasticNet regressions. Based on the general linear trend among the target and the predictors as mentioned earlier, we expected the linear models to work fine with the data. Since the data had more than 200 features and we do not have an exact way to choose them according to their importance for predicting the house price, it would be difficult to use the general linear regression models. We decided to directly try the regularized regression models so we can select the meaningful features for the prediction, mitigate overfitting and overcome multicollinearity problems at the same time.

     Because Lasso and Ridge regressions put constraints on the size of the coefficients associated to each variable, which depend on the magnitude of each variable, standardization as we mentioned before was necessary. Second, we removed the outliers which were significantly far off from the linear relationship between the target and some of the main predictors as shown in the plots below. Although the outlier removal might have caused information loss, we saw that it did improved the performance of the models when comparing the results from before and after their removal.

   Results

 The results of the Ridge, Lasso and ElasicNet models, with the hyper parameters used are shown below. The hyper parameters, ฮป and the L1 ratio, were optimized by using grid searches with the LassoCV/RidgeCV/ElasticNetCV (K=10)  functions from the Scikit-Learn package. For the model evaluation, 10-folds cross validations were used for each model.

Linear Model Summary

     In the plots comparing our prediction from the Ridge/Lasso models to the original target, all the models seemed to agree pretty well. All the models got R2 around 0.92, RMSE of less than 0.12, and got the best Kaggle leaderboard scores among the other models that we ran. As we expected, the target variable, sale price (log-transformed), showed a relatively linear relationship with the predictors. The ridge model received the best Kaggle leaderboard score, but the other models show similar performances as well.

 

 Coefficient Plots

   We also used the Lasso method to generate the coefficient plots below showing the importance of the different variables. In the plot, the later a variable turns to zero, the more it affects the target. The two variables that most influence the model are Total Square Ft. and Overall Quality. Other important variables are the Ground Living Area, Year Built, and Overall Condition.

     Another way to view which feature importance is shown below. The top-20 variables ranked by magnitude of the coefficients from our best lasso model is plotted, showing the same variables, Total Square Ft, Overall Quality, etc. affect the house sale price the most. One should note that in general, the size of the coefficients may not be an indicator of feature importance. But since we have scaled all our variables, we can use this metric as measure of feature importance more readily.

Magnitude of Coefficients for Lasso Fit

Support Vector Machine Regression

     After trying the Ridge / Lasso based linear models, we tried the SVM based regression to see if we can use a different model to get results that are just as good or better by using parameter tuning methods such as grid search and cross-validation (CV). For experimental purposes, we first tested SVR without parameter tuning, then obtained benchmark results with parameter tuning. We recorded the RMSLE benchmark using a 5-fold CV done on the training set, Kaggle leaderboard score, and finally the computational time of each configuration. Below is a table summary:

Results     

The results we obtained through SVR showed us several key points. First, the choice of kernel in the SVR model plays a critical role in all three statistics. The linear kernel produced extremely small RMSLE even before parameter tuning, indicative of severe over-fitting. While the Gaussian kernel had relatively large RMSLE, but actually showing an improved Kaggle score over the linear kernel.

     The next trend we observed was that in general SVR training time increases by at least one order of magnitude when we used GridSearchCV as a parameter tuning framework. This suggests that in projects involving larger datasets, one is advised to first run the model without parameter tuning as a benchmark, as model performance based on different kernels correspond well with performance after parameter tuning. For example, the RBF (Gaussian) kernel achieved the best Kaggle leaderboard result both before and after tuning. Conversely, Poly and Sigmoid kernels performed poorly both before tuning and after tuning.

     The last conclusion we can draw is that the 5-fold CV benchmark on the training set for different model kernels is a good indicator for the Kaggle performance of the kernels. If a model performs well under the 5-fold CV benchmark, it is likely to perform well in the test set as well.

Random Forest

     Due to the high number of categorical features we felt the next best course of action would be to train a Random Forest model because of its inherent resiliency to non-scaled and categorical features. It would also allowed us to Even though it may not be the most time efficient process, we implemented a Grid Search Cross Validation method to tune for the best hyperparameters.  We started with a fairly coarse grid search tuning over large gaps in the parameters and ended with a very fine search to hone in on the best parameters.  

     To test the usefulness of these hyperparameters we also modeled a base random forest estimator, using just 10 trees and the rest as default settings.  With this base estimator we achieved an accuracy of 99.20% with an average error rate of 0.0954. Our tuned model achieved an accuracy of 99.26% with an average error rate of 0.0888.  With only a 0.06% increase in model accuracy, in most cases it would not have been worth it to spend the time tuning, especially for large datasets. This shows that our hyperparameter optimization process is not as efficient as it could be. 

Looking Forward / Summary

     As we completed our analysis of the dataset, we thought of ways that we can improve our model. One idea that we discussed but did not have time to implement was to perform some sort of classification before doing the modeling. We could add our own classes or groupings as variables and check feature importance to see if and how our models changed based on this new variable. These classification can even be done with unsupervised methods such as clustering to discover hidden groupings within the data and utilize them as new variables. Finally, we could have use ensemble methods to combine our models to obtain the best results.

     In conclusion, this is a basic analysis of the dataset using relatively rudimentary modeling techniques. Given the relatively simplicity of the data, despite the large number of features, it is not surprising that we obtained the best results with our linear models. With more time and now a greater understanding of what other modeling processes are out there, we feel that a much more in depth analysis and subsequent modeling process can be done.

About Authors

Arthur Yu

I have a PhD in Physics from UC Irvine, following my B.A. degree in Physics and Mathematics from NYU. My main interests in data science are understanding and improving machine learning models and algorithms. I hope to learn...
View all posts by Arthur Yu >

Will Thurston

Will is currently a student at New York City Data Science Academy. He graduated from Rochester Institute of Technology with a BS in Computer Security in 2016. He then spent the following year as a Network Engineer gaining...
View all posts by Will Thurston >

Jason Lai

I graduated from NYU with a bachelor's degree in Math.
View all posts by Jason Lai >

Soomin Park

View all posts by Soomin Park >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Hakon May 11, 2019
Could you please add a link to the website where you got the photo from: https://www.ImageFree.com
Data science KPHB September 4, 2018
Nice article <a ref " https://socialprachar.com/data-science-course-classroom-training-hyderabad/ " Data science course hyderabad

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application