Analyzing Data to Predict Home Prices in Ames, IA

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Project Github| Aditya's LinkedIn | Ryan's LinkedIn| Gabby's LinkedIn

Introduction: 

For this project, we took on the role of Data Scientists hired by Zillow Offers to create a pricing model for homes in Ames, Iowa. Using data on over 2,500 home sales between 2006 - 2010, it was our responsibility to thoroughly clean and analyze this data, and then create machine learning models to accurately price homes and answer questions about the real estate market in Ames. We implemented both supervised and unsupervised ML methods to predict housing prices and to create a handbook of data-driven insights for use by Zillow-affiliated real estate agents.

Business Angle:

Zillow Offers is an instant buying service provided to individual homeowners. Simply fill out a short online form about your home and Zillow will send an offer to purchase in just two business days, contingent upon an inspection. You don't have to show your home, don't have to prep it for sale, can set your own move-out date, and have surety of payment.

Also, if you choose not to sell your home to Zillow you can list it on the open market through a Zillow Premier Agent. Zillow views its Offers segment as a service to home sellers and as a way to drive volume into Zillow's other businesses. Offers run at 'razor-thin' margins, with a 2019 analysis of instant home buyers showing an average profit over cost of renovations of just 1.3%. Zillow Offers bought 4,162 homes in 2020 and is currently available in 25 metro areas across the country.

Zillow specifically cites its "superior data science and technology" as a competitive advantage, making this business angle chosen for our project particularly relevant. We decided to tackle two significant business problems for Zillow Offers as it hypothetically expands into Ames, Iowa; accurately predicting home prices and maintaining positive relations with the existing real estate industry in the area. The second point can be particularly difficult for Zillow as it continues to make inroads into the home buying industry.

To provide incentive for real estate agents to work with Zillow rather than isolate the company, we will assemble a handbook for Zillow Premier Agents that provides useful data-driven insights into the local housing market in Ames, Iowa. This information will help agents provide more value to their clients and encourage them to work with Zillow, not against it.

Data on Feature Engineering:

In order to proceed with modeling, the data needed to be cleaned and transformed. We used a combination of domain knowledge (for example, when combining classes of exterior coverings of similar value) and statistical tests (such as t-tests and chi-squared tests) to condense and simplify our dataset while retaining useful information.

Through the process of dummification and binarization, we were able to create interpretable datasets for our machine learning models. Some features were transformed to show slightly different information, such as combining separate porch type features into a binary feature for whether a home has a porch or not. 

Neighborhood Clustering

Neighborhood Clustering: Another feature we engineered was created by performing hierarchical clustering on the Neighborhood feature of our original data. This column originally consisted of almost 30 unique values, most of which were sparsely represented.

Though we thought the neighborhood of a house would be important in its sale price, we wanted to narrow the column down to the most informative neighborhoods to minimize the number of columns when dummified while maintaining as much information as possible.

To do so, we clustered on each neighborhood in the test data as an instance on the following attributes: gross living area, year built, school district, sale price and distance from campus. In doing so we identified three unique clusters. We assigned these cluster labels to our data and used them in our modeling process.

Analyzing Data to Predict Home Prices in Ames, IA

Row Selection

Row Selection: Our business angle was kept in mind at all times as we went through the process of cleaning our data. Since our model is used to predict the prices of homes in Ames, Iowa for Zillow Offers, we wanted to (1) stick to the type of property Zillow would potentially provide offers for, and (2) narrow the ranges of key features in our model to values we have pricing data for.

Examples of our row filtering included limiting the model’s use to homes below 3,000sqft (there were very few homes larger than this) and dropping homes with poor overall conditions as Zillow is not in the business of flipping/renovating homes. We also dropped home sale records that didn’t represent normal open-market transactions, such as foreclosure sales.

Data on Our Models: 

The machine learning models we implemented to predict sale prices were penalized and stepwise regression, random forest, and support vector regressor. Feature selection was extremely important in creating a properly fit model as we had over 100 attributes after dummifying necessary columns and implementing our initial feature screening process. Our original thought was to use lasso regression to help prune our 100+ columns to the most important ones and limit variance inflation and multicollinearity.

The lasso penalized linear regression model narrowed our feature count down to 69 and scored a test R2  of 92%. We also tried stepwise regression, implemented in R, which ultimately became our preferred model due to its simplicity and interpretability. This model used only 31 features and had a comparable R2 of 91%, without significant multicollinearity concerns (as measured by coefficient VIF’s).

In addition, a random forest non-linear model was run which used 86 features and had an R2 of 90%. Support Vector Regression was also tried. It performed worse than the null model with the default RBF kernel, but performed well using a linear kernel with an R2 score of 95%. We, however, opted for the more interpretable linear regression model given its similar R2 performance.

Analyzing Data to Predict Home Prices in Ames, IA

Pricing Model:

For our pricing model we chose a simple linear regression with 31 features determined through stepwise feature selection. This model exhibited similar predictive metrics as more complex models, while maintaining the interpretability of its calculations.

The features used in the model, and their coefficients, can be found in the accompanying table. Unsurprisingly, we found the most important features to be the size of the house, its age, and its overall quality. A surprising find was that our model did not select to use the number of bathrooms in a house as a feature. When we added this feature back to the model, it increased multicollinearity between features while adding no predictive power. This signals that the number of bathrooms in a house in Ames can generally be predicted based on the features already used in the model, including size and age of the home.

Running our test data set through the pricing model resulted in an  R2  of 92% and a residual standard error of $19,390. While this error is large compared to the estimated average profit per house of 1.3% for instant homebuying programs, we assume this adds variance to our per-home profit and that it does not result in systematic mispricing. Future collection of more home sale records in Ames would reduce our residual standard error further, resulting in a more accurate model.

Data Driven Handbook of Insights:

With the pricing model completed, it was time to turn our attention to building a handbook of data-driven insights for use by Zillow-affiliated real estate agents. We investigated several interesting questions and used our models to provide answers for our agents.

Kitchen

Kitchen Renovations: Kitchens are very expensive to remodel, with online sources indicating a midrange kitchen renovation often totals $40,000 - $50,000. Through interpretation of our machine learning model,  we found that homes in Ames with excellent kitchens command just a $24,300 premium.

Since it is cheaper to buy a home with a high-quality kitchen than to upgrade one post-purchase, buyers looking for excellent-quality kitchens should purchase a home with one rather than planning a remodel, and sellers should avoid doing full kitchen renovations as they are unlikely to recoup their investment.

Asbestos Shingles

Asbestos Shingles: Asbestos was a widely used house exterior material until the 1970's when it was banned. The negative health effects of asbestos are widely known yet many homes still have their original asbestos exteriors. It is not a serious health issue until the asbestos becomes cracked and worn and fibers become airborne, but the shingles will eventually need to be replaced. It is an expensive process requiring special safety gear and permitting.

Our model indicates a $13,400 hit to home resale value from having asbestos shingles rather than more typical vinyl siding or plywood covering. The cost to both dispose of and replace asbestos with another exterior material is likely to be significantly higher than this, so we recommend that buyers avoid homes with asbestos siding and that sellers forego replacing it and sell their home as-is.

Bathrooms

Bathrooms: Available online sources indicate that the cost of adding a new bathroom to a home averages $10,000. Given that our pricing algorithm does not increase in accuracy when including bathroom count, it is not recommended that sellers add to the bathroom count of their home. It is, however, recommended that buyers make sure to purchase a home with the number of bathrooms they desire. Money invested in adding bathrooms after purchase is unlikely to be recouped at a future sale of the home.

Remodeling

Remodeling: Another question we explored was how much discrepancy exists in the sale price of remodeled and unremodeled houses of the same year. To answer this, we isolated the years in our data that had a sufficient split between both types, 1950-1999, and split the data into remodeled and unremodeled houses.

We performed a simple linear regression on each dataset to isolate the base price and the effect of year built on the sale price. We found that a remodeled home built in 1950 sells for about $20,000 more than its unremodeled counterpart. In addition, the effect of age on the home values of both remodeled and unremodeled homes is roughly the same. 

Data on Sale Comps

Automatic Sales Comps: As a real estate agent, much time is spent gathering sales comps of properties being sold. We decided to automate this process by using K-Nearest Neighbor on our data set. The real estate agent can input housing features like gross living area, lot area, year built, and more and get a list of comparable properties and their sales price within a specified distance.

This was not built specifically for predicting sale price of the inputted house with KNN because we already had a robust, interpretable model that performed well, however, this can save real estate agents time and give them insight into similar properties in the neighborhood. 

Future Work:

Given the timeframe and resources we had for this assignment, there were certain elements that left us restricted. For example, machine learning models perform better the more high-quality data they have as inputs. With much more data, we would likely achieve higher performance from our models, and could even potentially use more complicated models.

A future path of investigation would be to source more data for our model training. We could use the value of the Case-Shiller HPI to adjust sales prices to a reference year of our choosing, allowing us to use more data while minimizing the error introduced by using home sales prices at different points in time.

Another limiting element was domain knowledge. With no prior knowledge of real estate markets, we were left vulnerable in certain aspects. For example, initially, our highest scoring feature importance in all our models was ‘Lnd_AcS’, a column that we had no knowledge of and was not included in the reference data dictionary.

For this reason, we had to drop the feature entirely but with more knowledge, we may be able to find better results in our modeling with it included. Interviewing real estate agents and home inspectors would increase our understanding of the data we have and likely help us in our feature engineering process.

To learn more about our machine learning project on Ames, Iowa Housing Prices, visit our Dash App in the link above to experience an interactive run-through of our research and findings.

 

About Authors

Gabrielle Klein

Recent graduate from the University of Chicago with a Master's in Applied and Computational Mathematics. Experience programming in Python, R, and Matlab. Passionate about all things math, looking forward to launching a career in data science.
View all posts by Gabrielle Klein >

Ryan Burakowski

Ryan Burakowski is a current NYC Data Science Academy fellow with experience in capital markets and a passion for working on difficult problems. He spent the last three years as a proprietary trader, traveling the world and living...
View all posts by Ryan Burakowski >

Aditya Jayasuri

Aditya is a recent Data Science graduate at NYC Data Science Academy with hopes of paving a new pathway in his career. A graduate from Drexel University with a B.S. in Entertainment & Arts Management previous experience includes...
View all posts by Aditya Jayasuri >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI