NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Data Visualization > Data-driven Predictions of Property Sale Value in Ames, Iowa

Data-driven Predictions of Property Sale Value in Ames, Iowa

Rishi Goutam, James Goudreault and Srikar Pamidimukkala
Posted on Mar 9, 2022

The Ames housing dataset is famous as an open Kaggle competition and for its use in undergraduate statistics courses as an end-of-term regression project. Unlike the similar Boston housing data, it has a relatively large number of variables (81 versus 14) and observations (2,580 versus 506). In tackling this problem, we would have to go beyond a simple, automatic algorithm such as stepwise selection, to construct a final model.

In this project, we took on the persona of being a data science team at a made-up local realty firm, Regression Realty, in early 2011. Our stakeholders were primarily our firmโ€™s listing and selling agents and our goals were to:

  1. Provide insight into Amesโ€™ housing market conditions given that we were just out of a major financial crisis.
  2. Surface advice they could give to their clients. For instance, tell a buyer a house is overpriced or a seller that they could make some money by remodeling a fireplace.
  3. Create predictive models that can be meaningfully used in our firmโ€™s mobile or web app. This means not using too many features (agents have to input data) or taking a long time to run a model.

You can find our code and presentation slides on GitHub.

The Ames dataset

Cleaning

The data comprises information collected from the Ames City Assessorโ€™s Office (Rather than from traditional MLS data sources) during 2006โ€”2010 with variables being nominal, ordinal, continuous, and discrete in nature. As a real-world dataset, it required extensive cleaning and feature engineering to be able to garner insights and create models.

We dropped one duplicate observation and removed two outliers (Where GrLivArea > 4,000 feet2) as per the recommendation of the dataset author. Another approach would have been to define outliers as properties whose sale prices were more than, say, four standard deviations from the mean. This would have dropped five points in our case.

Imputation

We imputed missing values (For implementation details, see clean.ipynb) for features by taking the mean, median, or mode as applicable of the feature category.

Sometimes, imputation was done by making reasonable assumptions. For instance, a missing GarageYrBlt can be imputed by the year the property was built. Finally, we imputed None (e.g., MasVnrType) or 0 (e.g., MasVnrArea) for values that might not exist or for which an educated guess could not be made.

Feature Engineering

In addition to the given datasetโ€™s features, we added two more we thought might influence a propertyโ€™s SalePrice:

  1. The public school district for a property
  2. Interest rate for the month the property was sold as determined by the 10-year treasury yields from the ^TNX index.

Ames has five school districts and we might expect property value to vary based on the quality of the school. By determining whether a propertyโ€™s latitude and longitude coordinates fall within the districtโ€™s boundaries, we can determine the school district. First, we used geopy (See geo_locations.ipynb) to query OpenStreetMapโ€™s Nominatim geocoding software to acquire the property coordinates. There are several python and R tools to help solve the point-in-polygon problem (See districts.R). We used Rโ€™s sf package.

 

We also derived features we believed might be useful, such as whether a property is near a park, an arterial road, or a rail line; if it is a Planned Unit Development (PUD), has been renovated, has a pool, the number of floors, etc.

Feature engineering was conducted to also

  • Ordinalize some categorical features (*Qual,*Cond, Neighborhood, etc)
  • Combine multiple features into a single feature (StreetAlley, Total Outdoor foot2, etc)
  • Collapse features into smaller set of categories (MSSubClass, etc)

See engineer.ipynb

Notable Findings from Data

Comprehensive exploratory data analysis produced a lot of insights and informed feature selection and engineeringโ€”here follows the distilled version.

First, we see Ames properties plotted against school districts. We also plot the schools, a hospital, and Iowa State University.

We will later see whether school districts matter (and how much) for a city like Ames. Compared to features such as Neighborhood, we will see that district is not as important.

Still, all that mapping was useful as an EDA tool and to check our assumptions.

Area features display increasing variance as we see below. We see the plots of SalePrice (y) versus GrLivArea (x) as well as the residuals.

We went with a log-linear transformation in our models as it showed a lot of improvement in reducing the โ€œ<โ€-shaped spread in property sale prices. Log-log shows marginal improvement, but is also harder to explain to a layman. When taking a transformation, we must remember to reverse the transformation after our model makes a prediction.

Size mattersโ€ฆin most cases. The price of a single family home is strongly correlated with the number of rooms in the home. Not so with other home types.

Neighborhood is very strongly predictive of price. As they say, location, location, location!

What this means is that we can treat Neighborhood as an ordinal feature in our models.

 

Market Data Analysis

In addition to exploring the dataset itself, we wanted to focus on the Ames housing market dynamics. We came in to this with a lot of assumptionsโ€”the 2008 crash must have caused house prices to fall. Seasonality patterns in the housing market must change too. Surely, single-family homes would be more resilient than multi-dwelling units to the depressed market conditions?

We used a SARIMA model to look at average SalePrice for the dataset. The properties were split up by quintile to see if there was a difference between expensive and cheap houses.

We can see that in 2006-2010, the house prices by quintile stayed relatively constant. This is counterintuitive, since the housing market collapsed in 2008, and we do not see that reflected in any of the quintiles. Given that we lived through this recession and knew house prices to be falling where we grew up, we were perplexed at what was going on in Ames. We conducted a Ljung-Box test (See code) to see if we could reject H0โ€”whether the data are independently distributed. I.e., the correlations in the population from which the sample is taken are zero, so that any observed correlations in the data result from the randomness of the sampling process.

# Ljung-Box Testโ€”We test to reject H_0, looking for p<.05
test1 <- Box.test(pur.ts,type="Ljung-Box", lag= log(nrow(pur.ts))) # p = .99
test2 <- Box.test(o.ts,type="Ljung-Box", lag= log(nrow(o.ts)))     # p = .14
test3 <- Box.test(t.ts,type="Ljung-Box", lag= log(nrow(t.ts)))     # p = .34
test4 <- Box.test(th.ts,type="Ljung-Box", lag= log(nrow(th.ts)))   # p = .94
test5 <- Box.test(fo.ts,type="Ljung-Box", lag= log(nrow(fo.ts)))   # p = .33
test6 <- Box.test(fi.ts,type="Ljung-Box", lag= log(nrow(fi.ts)))   # p = .31

The Ljung-Box test for each quintile reports a p-value that is over the threshold of 0.05, meaning these values are white noise distributions around an average value. This is an extremely stationary datasetโ€”remarkably so.

However, looking at the U.S. Federal Housing Finance Agency data from 2006-2010, we can see that in Iowa, the prices stayed remarkably constant even through the housing market crisis. This accounts for how unintuitive the results weโ€™ve seen wereโ€“housing in Iowa itself behaved in a counterintuitive way.

Looking on to the left, we can see that the whole of the average number of houses sold at each date oscillates with a regular seasonality through the year.

However, when we split housing stock into the upper- and lower-half, we can see that, before 2008, the more expensive prices were sold more often. After the crash, the cheaper houses were sold more often. However, we start to see the original pattern return around 2010, the year which the housing market began to significantly recover.

This is important for Regression Realtyโ€™s realtors, as they know that they should now switch back to focusing on expensive houses and that the market might be recovering.

Finally, we use a SARIMA model

SARIMA
Seasonal Autoregressive Integrated Moving Average: An extension to ARIMA that supports the direct modeling of the seasonal component of the time series. It is a statistical model that uses Seasonality, Autocorrelation, Differencing, and Moving Averages to predict future data given only past data.

to predict property prices given past data with an 85% confidence interval. Our model uses the past four years of data to predict the following year with an RMSE of $14,167. Remember that this is the prediction for the housing market, not a particular house. We could adapt this model to train on a particular Neighborhood or quintile to get more targetted average predictions for a realtor.

Predictive Data Models

We used several models in this project to predict SalePrice. In order, they were

  1. Linear Regression
  2. Elastic-Net
  3. Random Forest
  4. SVR
  5. Neural Networks

And separately for the market analysis,

  1. Time Series

Data Model Scoring

Although there were multiple complicated models that we applied to the dataset, they did not drastically outperform linear regression in predictive power. This dataset is very conducive towards using linear regressionโ€”unsurprising as it was designed to be a regression end-of-semester project.

Some models take a long time to code for a data science team and others take a long time to train, so there are trade-offs to be made. For our case, we wanted a model that served our three goals.

Below is a comparison of our models.

Model R2 train R2 test RMSE
MLR 0.938 0.911 0.050
Elastic-Net 0.933 0.922 0.046
Random Forest 0.986 0.915 0.047
Gradient Boosting 0.994 0.927 0.043
SVR 0.926 0.922 0.045
Neural Network 0.937 0.895 0.032

Table 1. Predicting SalePrice for a specific property

In addition, we used a time series model to predict average SalePrice for the entire Ames housing market. This would be a more robust with more observations, such as in a larger city or if we had more complete data for Ames.

Model R2 train R2 test RMSE
SARIMA 0.528 0.077 $14,167

Table 2. Predicting mean SalePrice for the Ames housing market

Statistical Validation of Data

Applying linear regression blindly is folly. For the Ames dataset, the four assumptions of linear regression are

  1. Linearityโ€“that there exists a linear relationship between the SalePrice and predictors (features)
  2. Independenceโ€“the residuals are independent
  3. Homoscedasticityโ€“the residuals have constant variance
  4. Normalityโ€“the residuals are normally distributed

The following shows ways we checked the assumptions of multiple linear regression. We paid special attention to not inflate VIF by introducing multicolinearity when adding new features during feature selection.

  • First, from the correlation matrix (The image is illustrative, we did not use these exact features in our linear model), we see that there is low multicolinearity between features.
  • The residuals are normally distributed.

  • We see that the residuals are linear.

  • Visibly, the residuals are linear.

  • The Variance Inflation Factor (VIF) quantifies the severity of multicollinearity in regresion analysis. It provides an estimate of how much the variance of a regression coefficient is increased because of collinearity. The table below shows a low generalized VIF for our predictors (We use the R car package's vif function).
vif(model) VIF df VIFdfโ„2
GrLivArea 4.4 1 2.10
TotalFinBsmtSF 1.57 1 1.25
TotalOutdoorSF 1.39 1 1.18
LotArea 1.48 1 1.22
BedroomAbvGr 2.33 1 1.53
MSSubClass 98.42 14 1.18
MSZoning 37.02 6 1.35
OverallQual 3.42 1 1.85
OverallCond 1.48 1 1.22
Neighborhood 2261.42 27 1.15
KitchenQual 2.2 1 1.48
SaleCondition 1.51 5 1.04
YrSold 1.21 4 1.02
CentralAir 1.51 1 1.23
Fireplaces 1.72 1 1.31
HasPool 1.06 1 1.03
IsNearNegativeCondition 1.12 1 1.06
LandContour 1.87 3 1.11

Table 3. Generalized vif with values mostly under 2

  • By looking at the predictor slopes and their confidence intervals, we see that some encompass 0. HasPool was thus removed as a predictor.

Realtor Recommendations Based on Data

The most simple model, multiple linear regression, has benefits in that it is easily explainable. We see that by changing a feature by 1-unit, there is a positive or negative effect on SalePrice. This is useful for a realtor as they might be able to recommend a client install central air conditioning should they find a good deal on installation.

Feature Expected Cost Sale Price Increase
GrLivArea NA $57โ„ft2
BsmtLivArea NA $22โ„ft2
Finished outdoor space NA $11โ„ft2
Installing a fireplace $2-5K $4,380
Installing central air $3-15K $23,599
Near railroad NA -$7,675

Table 4. Effect of 1-unit change on mean SalePrice

Conclusion

We achieved our aim of providing value for the (fictitious) realtors in our organization through our data analysis and predictive modeling. The insights and predictions can be surfaced through a Regression Realty app so that an agent can see recommendations for renovating or setting a sale price on a particular house.

Our fictitious firm would provide us access to MLS listings, which would be incorporated into the app. (Unfortunately, MLS data is not publicly available, so we could not include it in our analysis.) This would cut down on the number of features a realtor would have to manually enter into our app, as the data would be auto-populated wherever possible from MLS.

Future Development

If we had more time, we would have liked to

  • Use log-log transformations for our models,
  • Create a mobile app for our fake realtors at Regression Realty
  • Incorporate household income and MLS data into our models
  • Investigate the impact of distance from Iowa State University or other locations

This is not promising as we can see from the maps that YearBuilt would trump distance from a major location

  • Add interaction and polynomial features
data
For instance, some Quality features appear to be quadratic

Engineering Process Data

We tackled this project together with team members taking on different models. The data pipeline is shown below. Note that we would have liked to have a train-test split that was the same across all models, but did not have time to implement that. As a result, the model results are not directly comparable.

data

References

Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project
http://jse.amstat.org/v19n3/decock.pdf

Iowa House Prices over Time

Treasury Yields (TNX)

Ames City Neighborhoods

Ames School Districts (boundaries, shapefile)

Fireplace Installation Cost

Central Air Installation Cost

Lillicrap, T. P. et. al., โ€œBackpropagation and the brainโ€. Nat. Rev. Neurosci. 21, 335-346 (2020)

About Authors

Rishi Goutam

Data scientist and software engineer. Previously, I was a software developer at Microsoft (Azure) and Amazon (Kindle). You can find me at www.goutam.io
View all posts by Rishi Goutam >

James Goudreault

Data Scientist with nine years prior experience in manufacturing and operations. I'm passionate about using data to solve complex problems with demonstrated ability to plan, communicate, and achieve sustainable results. I love finding clever ways to approach situations...
View all posts by James Goudreault >

Srikar Pamidimukkala

Mathematically fluent data scientist with 5 years of technical and engineering communication skills across a wide range of audiences and expertise levels. B.S. in Materials Science and Engineering from Georgia Institute of Technology. Currently studying in an M.S....
View all posts by Srikar Pamidimukkala >

Related Articles

Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
R
R Shiny Shows Decline in Even Strongest Democracies
Python
Data Analysis on Car Accidents in the US
Meetup
Examining Digital Connectivity in Kenya's 2019 Census Data
Student Works
Data Analysis on Airbnb in NYC

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application