Ames Iowa Housing Market Data Analysis

Posted on Aug 18, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

The Ames Iowa Housing Dataset is a public dataset made available by Kaggle.  The dataset is a record of houses sold in Ames Iowa between the years 2006 and 2010 with each entry being represented by 79 exploratory data variables and the sale price of a house.  This project intends to use regression techniques to predict the sale price of houses using the exploratory variables and to determine which variables are most likely to increase the property value of a home in Ames.

The code used for this project can be found on Github.

About the dataset

The Ames Iowa dataset was collected over the course of several years in order to characterize the housing market of Ames, Iowa.  Before conducting this analysis, I made a hypothesized which variables would most strongly correlate with an increase in housing prices.  Of the variables, I predicted that the following four variables would be the most important.

  • Lot Area - The square feet of the lot for sale
  • Overall Quality - A rating (0-10) defining the material and finish quality of the home
  • Ground Living Area - The home's square footage above ground level (excluding the basement level)
  • Neighborhood Location - The name of the neighborhood where the house is located

Background Data

Before exploring regression techniques, I looked at the documentation of the dataset that can be found here, and I followed any recommendations.  The most important caveat presented by the dataset documentation is addressing the outliers in the data.  The study that collected this data mentioned that several entries recorded the sale of properties with a very high square footage and at a very low sale price with respect to other houses sold.  Because of this, it is recommended that any house with a ground living area greater than 4000 square feet be omitted from the data.  After removing these outliers, the sale price versus ground living area clearly demonstrate a positive trend as ground living area increases.

Ames Iowa Housing Market Data Analysis

In addition to assessing potential outliers, I decided to investigate the sale price per year built.  I wanted to determine the distribution of sale price as it related to the year of the home built to determine the impact of the market on how many houses were built, the sale price of the houses built, and the general impact of inflation.  Based on the results of this plot, it is clear that this is an upward trend consistent with inflation, and certain economic events have had an impact on the number of houses built in each year from 1991 until 2010.

Ames Iowa Housing Market Data Analysis

Missing Data

Many entries in this dataset contain missing variables.  In total 19 variables demonstrate some degree of missingness.  Of those varables four have a very high number (0ver 75% of all entries) of missing values.  Those variables include:

  • Pool Quality
  • Fence
  • Alley
  • Misc Features

Ames Iowa Housing Market Data Analysis

In order to get the most value out of this dataset, instead of dropping variables with missing data, I imputed the missing data based on characteristics of the available data.  Based on the most recent city planning report that I was able to find, it is clear that neighborhoods within Ames have some uniform characteristics because of the city's water resource management plan.  Because of the uniformity of neighborhood development and lot size, I determined that it would be reasonable to impute missing data for the following variables based on the average characteristics of the neighborhood location of each entry.

  • Garage year built
  • Lot frontage

For the remaining 17 variables, I imputed the data based on the average for all values of the variable.

Data Cleaning

Several regression algorithms were used to predict the price of unlabeled housing entries, and two data cleaning methods were used.

Multiple Linear Regression Data

Because Multiple Linear Regression (MLR) is not robust to the issue of multicolinearity, variables that were determined to have a high correlation were dropped from the dataset for MLR analysis.  The 29 least correlated variables were determined to have an acceptable correlation with all other variables, and all other variables were dropped.

All Other Regression Algorithms

The distribution of the Sale Price data appeared to have a slightly right skew.  Because regression algorithms are to be used in analysis, it is favorable for the Sale Price data to have a normal distribution.  To correct the skew the natural log transform was applied to the Sale Price data.  The distribution of the sale price data can be seen before and after the natural log transform in the figures below.

Regression Results

Five regression based and tree based algorithms were used to predict the sale price of the housing data.

  • Multiple Linear Regression (MLR)
  • Ridge Regression
  • Lasso Regression
  • Elastic Net Regression
  • Gradient Boosting

The root mean squared error and r squared values of each algorithm's training results, after hyperparameter tuning, can be seen in the figure below.

It is clear that in training, gradient boosting performed with the lowest error, best fitting the sale price of unlabeled validation entries.  In gradient boosting, the most accessible hyperparameter is the number of rounds of gradient descent.  The image below shows the root mean squared error plotted with respect to the number of boost rounds.  It is clear that the test rmse mean appears to show diminishing returns as the boost rounds increase past 50.  To conserve computing resources, 50 gradient descent rounds were used in testing.

Variable Importance

A price per unit change analysis was performed on the variables used in MLR.  It is important to use MLR instead of other regression algorithms because the unit price is accurately reflected by the model, whereas in other algorithms used, penalty hyperparameters and stochastic elements artificially induce a nonlinear relationship between independent variables and sale price.  In MLR, it appeared that pool quality was given a strong per unit importance because of the granularity of the variable itself.  Of the five highest ranking variables based on per unit sale price increase, four of them were related to the pool.

The reason for this is because dummy variables were created in order to represent ordinal data.  After eliminating the ordinal data, it is clear that ground living area and roof material become the most dominant features in the per unit change analysis.

Conclusions

It is clear that regression algorithms can be used to accurately predict the price of houses in Ames, Iowa based on the variables recorded in this dataset.  Additionally there are several features that greatly impact the price of a house.  For anyone who owns a home in Ames, Iowa interested in listing their house, it would be important to consider adding and addition or updating the roof tile material before selling the home to maximize the return a potential sale.

About Author

Matthew Boubin

Matt Boubin is an electrical engineer with three years of digital signal processing experience in commercial aviation.
View all posts by Matthew Boubin >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI