EDA and machine learning Ames housing price prediction project

Diana Dent

Posted on Apr 12, 2023

Introduction

Buying and investing in the real estate market is one of the biggest decisions people make To be certain that we're making a good real estate purchase, we need to know whether a house is priced fairly or even underpriced.

For this project, I used the Kaggle dataset to predict housing sale prices. The dataset contains 2580 records with 79 attributes for 2006-2010 years with detailed information about each house’s attributes and its sale price. In my analysis, I predicted the price of Ames homes based on features that correlate with sales price, including OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlSF, YearBuilt, FullBath, etc.

Project description

This project aims to analyze and predict the price of the Ames Housing market. This real-world problem project greatly applies data science and machine learning techniques.

Source of the data: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

I was inspired to work on this project because of its relevance to almost everyone who may consider investing in real estate or just purchasing a home. While this home pricing model is linked to one geographic location, it provides insight that is relevant to home prices in general.

Target audience

The target audience is home buyers and real estate investors who need to evaluate home prices in Ames, IA.

Dataset

We are using the Kaggle dataset with 2580 records with 79 attributes. This dataset is a sample dataset for the Housing Market in Ames, Iowa. The dataset contains the corresponding features for Ames homes, like PID, GrLivArea, SalePrice, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, etc.

You can download this dataset from Kaggle here: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

Research question

My goal was to build a model using a Kaggle dataset collected from Ames, Iowa that can help both realtors and prospective homeowners accurately price a house based on its available features.

Steps completed:

Conducted preprocessing and cleaning of a dataset and feature engineering
Performed EDA of the Ames Housing data set, using Python
Developed House Sale Price Predictive models – Linear Regression, KNN, and Decision Tree, using Python.

Data Preprocessing and Exploratory data analysis

The dataset contains missing values for 27 variables.

I cleaned and preprocessed the dataset, including removing duplicate rows, examining rows and columns with missing values, imputing some of those missing values, and engineering a few new variables. For example, I removed variables such as Alley, PoolQC, Fence, and MiscFeature with over 80% missing values. Also, I deleted all data with MSZoning C (commercial, agriculture, and industrial) as these are not residential units. I deleted the variable Condition2 because 99% of the values are Norm. I created a few new variables: Age, Slope_Gentle, RR_prox, and BsmtLivArea.

From the correlation matrix of the Ames dataset variable (please, see the matrix below), we found the variables which most correlated with our target variable SalePrice. They are:

We can see the most popular neighborhoods in the city of Ames on a plot below are Names, Old Town, and CollgCr.

From the chart below, we can see that the most popular house sales in the city of Ames were single-family homes and townhomes.

Analyzing the conditions of sold houses reveals that most sale conditions were normal 93.57%.

The 1-story and 2-story homes were in most demand at 50 % and 30 % respectively. Please, see the chart below.

The data was fairly right-skewed with a few outliers with higher sale prices. After a log transformation of the house sale price, the data became more normally distributed, which improved the regression models.

As a preparation to fit a machine model, I removed outliers. Also, to perform machine learning on a dataset, I created dummies of categorical variables, like MSSubClass, MSZoning, Street, LotShape, LandContour, LotConfig, Neighborhood, Condition1, BldgType, HouseStyle, RoofStyle, Exterior1st, Exterior2nd, etc.

Modeling

After completing the data preprocessing, exploratory data analysis, and feature engineering, I built a few machine-learning models. Models were selected based on which were most likely to have the highest accuracy. I selected three models: Linear Regression, KNN, and Decision Tree. I conducted a train and test split of 30% of the dataset. Each model was trained on the training split. I used grid search and cross-validation to find optimal parameters for tree-based models.

During the model evaluation phase, the linear regression model demonstrated the highest R2 score.

Conclusions

Putting Linear Regression, KNN, and Decision Tree to the test to find the best predictor of sale prices for homes in Ames, Iowa, I discovered that Linear Regression is the best ML model with an R2 of 91.47%.

Next steps

Future work could include:

Refining the model to improve accuracy like considering the interactions between the variables and working on model stacking
Incorporating additional features
Combining multiple models for greater machine learning power.

About Author

Diana Dent

A Data Scientist with 15 years of business experience in financial and project management. A proactive, detail-oriented data enthusiast with exceptional problem-solving skills. Interested in contributing SQL, Python, Excel, and Tableau mastery paired with advanced data analytics, machine...

View all posts by Diana Dent >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

No comments found.

EDA and machine learning Ames housing price prediction project

About Author

Diana Dent

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

EDA and machine learning Ames housing price prediction project

About Author

Diana Dent

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!