Data-driven Predictions of House Prices in Ames, Iowa

Posted on Mar 5, 2022

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

Jones Inc. is a hypothetical real estate agency/development company that wants a model to predict housing prices in Ames, Iowa so they know what they can ask for before putting a house on the market. The data used is: House Prices: Advanced Regression Techniques from Kaggle.

This company:

  • Sells homes for homeowners
  • Buys homes, makes improvements and sell the homes for a profit.
  • Builds new homes.

Base on the data found, the population of Ames is growing. The population in 2010 was 58,965.  The population in 2022 is 67,029. That’s a big difference. The rate of home ownership is almost 41%. There's a lot of opportunity for this company to make more money.

The model I built predicts, but also provides information about factors driving price.

Some key features influencing housing prices in Ames, Iowa:

  • Above grade (ground) living area in square feet
  • Quality of the overall material and finish of the house
  • Kitchen quality
  • House age
  • Overall condition of the house.

Kitchen quality is important. If a house was built in 1950 and the kitchen was never redone. It is probably worth it to do that before putting the house on the market.

I am going to walk you through the process that I went through to build the model for this company.

Outliers in the data

The first thing I did was remove outliers from the data because they can lead to seriously distorted linear models.

A point on the lower left of the graph below where the sale price is extremely low was removed as were the two points on the extreme right  where above ground living area is above 4,000.

Four points on the right of the graph below where lot area is extremely high were excluded.

The point on the extreme right of the graph below where lot frontage is above 300 was removed.

Additional observations dropped

Properties not zoned as residential were excluded.

Two neighborhoods were dropped:

  • GrnHill is a private senior citizen community and is not listed as a neighborhood in the data description.
  • Landmrk is not listed as a neighborhood in the data description or online

Categorical features

  • I imputed categorical features with missing values with No or None as suggested by the data description.  For example, where Fence has null values it means No Fence.
  • Some categorical features were actually ordinal. I recoded these as integers. Example: Kitchen Quality
  • Dummies were created for linear models
  • After doing all of this I broke the data out into training data (70%) and test data (30%)

Transformation of sale price data

Log of sale price was used in stead of the actual price because doing so makes it more closely resemble a normal distribution.  If the price is normally distributed, the residuals are more likely to be normally distributed which is a key assumption of linear regression.

This a histogram of sale price:

This is a histogram of the log of sale price.  This distribution more closely resembles a normal distribution.

 

Imputation of Numeric Features

I imputed all of the area and bathroom features were imputed with zero because most likely the house just doesn't have that feature. For example, where basement square footage is missing, the house probably just doesn’t have a basement.

These features were imputed with zero:

  • MasVnrArea
  • BsmtFullBath,
  • BsmtHalfBath
  • BsmtFinSF1
  • BsmtFinSF2
  • BsmtUnfSF
  • TotalBsmtSF
  • GarageCars

 

GarageYrBlt, was imputed with the year that the house was built because there is a good chance that the garage was built at the same time.

 

Lot Frontage

  • 17% of houses have missing values for Lot Frontage so I took a  more sophisticated approach.
  • The median value of Lot Frontage varies by neighborhood as the graph below demonstrates.  So, Lot Frontage was imputed with the median value of the neighborhood.

Transformations for linear data models

  • I looked at the correlation of each continuous feature with log price.
  • For features that did not have any zero values,  I used the Box Cox and log transformations.
  • For features that did have zero values, I used  the log(1 + feature value) and Yeo Johnson transformations.
  • I used the  transformation that had the highest correlation.
  • If the untransformed feature had the highest correlation, I used that instead.

Features Created

Indicator features created due to poor coverage:

  • has_pool
  • has_miscfeature
  • alley_access

 

Two discrete features were created:

  • house_age = YrSold – YearBuilt
  • years_since_remodeled = YrSold – YearRemodAdd

Additional Indicator features created:

  • has_wood_deck
  • has_openporch
  • has_EnclosedPorch
  • has_basement
  • has_finsished_basement

The  additional indicators were created in case the feature each was derived from caused any multicollinearity.  It turned out that finished basement square footage was highly correlated with total basement square footage.  Although it didn’t end up in the final model,  I used “has finished basement”  instead of “finished basement square footage.”

Grouping of Neighborhoods

There’s a lot of neighborhoods in Ames Iowa. Sale price varies considerably by neighborhood.   

Neighborhoods were grouped together based on median sale price.

Multicollinearity

I examined features that intuitively could be correlated, particularly the square footage and area predictors.

The following features had variance inflation factors greater than five:

  • BsmtFinSF1
  • BsmtFinSF2
  • BsmtUnfSF
  • GarageYrBlt
  • GarageCars
  • bc_GrLivArea
  • bc_LotArea
  • yeo_TotalBsmtSF
  • log_first_FlrSF
  • yeo_GarageArea
  • yeo_LotFrontage

The variance inflation factor doesn’t indicate what is correlated with what.  So, I looked at  a correlations matrix.

I consider a correlation of 0.4 or above to be high.

Of the features highly correlated with one another, I kept the ones with the highest correlation with log price.

Out of all the features in the correlations matrix I kept these three:

  • Box Cox transformation of Above (ground) living area (bc_GrLivArea)
  • Yeo Johnson Transformation of Total Basement Square Footage (yeo_TotalBsmtSF)
  • The Box Cox Transformation of Lot Area (bc_LotArea)

Data Models Built

Five different models were built:

Three Linear Models:

  • Ridge Regression
  • Lasso Regression
  • Elastic Net Regression

Two Tree Models:

  • Random Forest
  • Gradient Boosting

Linear Models Approach for Data

  • All features were standardized.
  • Those features causing multicollinearity were excluded.
  • In the building of each model in one of the earlier iterations,  I excluded dummy features with a mean of less than 0.05. This resulted in a marked reduction in the variance.
  • Features were sorted by the absolute value of the coefficients in descending order.
  • Features with coefficients with the smallest absolute values were gradually removed which resulted in a reduction in the RMSE (root mean squared error) of the test data.
  • Once the RMSE started to increase, additional features were not removed and the model iteration with the smallest RMSE was selected.

Tree Models Approach for Data

All features were standardized.

All categorical features were label encoded.

For Random Forest, depth curves using 500 trees were used to get a sense of what range of hyper parameters to try.

For Gradient Boosting,  R-squared curves at different depths were used to determine what depths and learning rates to try as well as the number of trees.

Depth Curves Random Forest

At a depth of 5, the RMSE is too high.  Somewhere between depths of 6 and 10, the test error reaches its minimum.  That's why I used a range from 6  to 10.

I used a broad range for minimum number of samples for each split and minimum number of samples for each leaf.

data

R-Squared Curves Gradient Boosting

For a depth of three, the curves are almost all at a right angle suggesting that there’s probably overfitting.

data

For a depth of two things look different. Not all curves look like there may be overfitting.

data

Based on the curves above, maximum depths for one and two, a tree range starting at 10,000, and three different learning rates were tried. The stumps worked best with 25,000 trees and a learning rate of 0.01.

Results from Data

data

Elastic Net is the  best model. Of the three linear models, it has the smallest root mean squared error, the smallest variance, and the smallest number of features.

The root mean squared error of the gradient boosting model is smaller than that of Elastic Net, but the variance is much higher.

In terms of dollars, the variance of Elastic Net is in the hundreds of dollars whereas with Gradient Boosting, it’s a few thousand.

The underlying relationship is most likely linear given the number of continuous features highly correlated with price.

Feature Importances

Similar predictors had the higher feature importances for the tree models.

Given more time these steps would be taken

  • I would group the neighborhoods into four categories of similar size.  The two largest neighborhood groupings did not end up in the final model nor did the two smallest.
  • I would exclude correlated features from the tree models.  Although mathematically, with the tree models, multicollinearity is not a big issue, there is no need to include redundant features.  Doing so makes the feature importances more difficult to interpret and harder to explain to the client.

 

About Author

Denise Garbato

I am a Statistician and Business Analyst who supports strategic decision making in digital and traditional marketing channels by discovering insights, applying statistical and programming skills with a results-focused approach. I am skilled in data analysis and predictive...
View all posts by Denise Garbato >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI