Predicting Ames Housing Sale Prices Using Machine Learning Models

Posted on Dec 6, 2022

Introduction

Ames is the 9th largest city in Iowa, with 66,427 residents based on the 2020 census. Iowa State University made up approximately half of the city's population and is the largest employer.

Putting myself in the shoes of a data scientist for an online real estate database company that provides house price estimation (like Zillow home price estimates), I performed the analysis and the goals were to,

  • Investigate possible major factors and features that influence house sale prices through exploration and analysis of Ames housing sales data
  • Build predictive machine learning models to make accurate predictions

Data

The Ames house price dataset used for the analysis has 2,580 records of house sales from 2006 to 2010. Its 81 attributes cover sales prices and a wide range of characteristics, including exterior and internal features, conditions, quality, etc. Additionally, the real estate market data provides the longitude and latitude of each house which supported the analysis of neighborhoods.

Exploratory Data Analysis

The analysis started with EDA to understand housing sales and their relations with different house features. Of the 81 attributes, 79 of them are the characteristics of the houses. To better analyze features one by one and detect the connections between similar features, I grouped them into different categories.

  • Sale
  • Neighborhood
  • Age
  • Lot
  • Other exterior
  • Quality & Condition
  • Size
  • No. of rooms
  • Basement
  • Others
  • Utility
  • Garage

For each feature group, I checked variable distributions and visualized their relationships with house sale prices. Specifically, I drew scatter plots for numerical features; whereas, for the categorical features, box plots of sale prices by the categories of the variable were used. Variables selected for modeling are based on the following rules,

  • Keep attributes that have clear relationships with sale prices
  • Drop attributes showed no relationship with prices both technically and intuitively
  • Drop attributes correlated with other independent variables. Only keep the major one among the correlated independent variables for modeling to avoid multicollinearity
  • Keep attributes with ambiguous relationships with sale prices for further technical analysis

Along with the above analysis, I explored interesting patterns in sales prices and their relationships with other features. Major highlights are as follows,

1)        Sale prices

Sale prices are right-skewed with a long tail from $200k to $755k. The middle 50% of prices range from $130k to $210k, and the median is about $160k.

Regarding the sale year, there were no major differences in price and count over years, even during the financial crisis in 2008-2009.

The seasonality in sales activities is obvious. The house sale market is more active in summertime (June & July), but this did not associate with higher prices.

2)        Neighborhood

For the 19 neighborhoods in Ames, there exist 2 major clusters on the west of the university and the north side of Ames. North Ames (the pink spots on the graph) is the neighborhood with the most sales through the years.

I visualized sale prices and ages by neighborhoods (sorted by the distance to Ames downtown) by violin plots. The result shows that neighborhoods closer to downtown generally have older houses.

Northridge and Northridge Heights, the two brand-new neighborhoods (average age of 2 years), also have the highest average prices. On the other hand, IDOTRR (railroad) and Southwest of ISU, the two oldest neighborhoods have mid-to-low-end sale prices on average.

3)        Size

Two types of features were found to be highly correlated with prices, size-related features (and the number of rooms), and quality and condition features.

The living area above ground, the total number of rooms above ground (excluding bathrooms), and the area of the garage are all correlated with sale prices.

4)        Quality & condition

As for the quality features, better qualities are associated with high prices intuitively. But for the different condition levels, the relationship is not significant, as the conditions are very concentrated at the middle level (5/10), which distorts the relationship with sale prices.

5)        Miscellaneous

  • Pools – Excellence of pools is associated with significantly higher prices. When modeling, either transforming this variable to a Y/N feature or tree-based models is able to better handle this pattern.

  • Number of fireplaces - Might be correlated with house size as larger houses may have more fireplaces.

  • Utilities and Exterior - houses with public utilities generally have higher prices than those with septic tanks. And sale prices vary among different exterior types.

Feature Engineering

New features were created or condensed to assist the analysis.

  • Age of the house- Year that the house was sold – Year house was built or remodeled
  • Age of garage- same approach
  • of bathrooms- adding the full bathrooms and half bathrooms
  • of bathrooms in the basement- same approach
  • Low-quality area above ground / Total area- this “low-quality ratio” is considered as it negatively affects the house’s sale price

With all the features selected and created by this step, I first label encoded ordinal variables into numerical. Then for all the numerical features, two analyses were conducted to further select features,

  • correlation matrix among all numerical features to detect and resolve multicollinearity

  • univariate analysis with house sale prices to refine features selection

Through the selection process, 58 numerical and categorical features are used for modeling prior to dummifications.

Modeling

The data is split by 70%-30% for training and testing respectively. Seven different models were fitted including linear models, non-linear models and tree-based models (MLR, ridge, lasso, elastic-net, SVR, random forest and XGBOOST). Those models are tuned by the grid search and the results are as follows,

  • XGBOOST is the best model with the best performance both on training and testing
  • From the prediction perspective, a tree-based model in general performs better than linear models.

Further Analysis

To better understand the model, I checked the feature importance for Lasso model and XGBoost model even though the results may not be intuitive as predictive models. Lasso is the good modeling choice for feature selection and XGBoost the is best predictor.

Although some important features cannot be explained by common sense, it is clear that overall quality, garage capacity, above-ground living area, number of fireplaces, etc. are the major price influencers in predictive models as well, which is also consistent with EDA conclusions.

As for the next steps of this study, the analysis would dig deeper into feature selections and feature engineering to generate optimal features for modeling. Additionally, modeling tuning is time-consuming. The analysis could be expanded with a wider range of hyperparameter selections to further improve predictive performance.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI