EDA and machine learning Ames housing price prediction project
Introduction
Buying and investing in the real estate market is one of the biggest decisions people make To be certain that we're making a good real estate purchase, we need to know whether a house is priced fairly or even underpriced.
For this project, I used the Kaggle dataset to predict housing sale prices. The dataset contains 2580 records with 79 attributes for 2006-2010 years with detailed information about each house’s attributes and its sale price. In my analysis, I predicted the price of Ames homes based on features that correlate with sales price, including OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlSF, YearBuilt, FullBath, etc.
Project description
This project aims to analyze and predict the price of the Ames Housing market. This real-world problem project greatly applies data science and machine learning techniques.
Source of the data: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
I was inspired to work on this project because of its relevance to almost everyone who may consider investing in real estate or just purchasing a home. While this home pricing model is linked to one geographic location, it provides insight that is relevant to home prices in general.
Target audience
The target audience is home buyers and real estate investors who need to evaluate home prices in Ames, IA.
Dataset
We are using the Kaggle dataset with 2580 records with 79 attributes. This dataset is a sample dataset for the Housing Market in Ames, Iowa. The dataset contains the corresponding features for Ames homes, like PID, GrLivArea, SalePrice, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, etc.
You can download this dataset from Kaggle here: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
Research question
My goal was to build a model using a Kaggle dataset collected from Ames, Iowa that can help both realtors and prospective homeowners accurately price a house based on its available features.
Steps completed:
- Conducted preprocessing and cleaning of a dataset and feature engineering
- Performed EDA of the Ames Housing data set, using Python
- Developed House Sale Price Predictive models – Linear Regression, KNN, and Decision Tree, using Python.
Data Preprocessing and Exploratory data analysis
The dataset contains missing values for 27 variables.
I cleaned and preprocessed the dataset, including removing duplicate rows, examining rows and columns with missing values, imputing some of those missing values, and engineering a few new variables. For example, I removed variables such as Alley, PoolQC, Fence, and MiscFeature with over 80% missing values. Also, I deleted all data with MSZoning C (commercial, agriculture, and industrial) as these are not residential units. I deleted the variable Condition2 because 99% of the values are Norm. I created a few new variables: Age, Slope_Gentle, RR_prox, and BsmtLivArea.
From the correlation matrix of the Ames dataset variable (please, see the matrix below), we found the variables which most correlated with our target variable SalePrice. They are:
We can see the most popular neighborhoods in the city of Ames on a plot below are Names, Old Town, and CollgCr.
From the chart below, we can see that the most popular house sales in the city of Ames were single-family homes and townhomes.
Analyzing the conditions of sold houses reveals that most sale conditions were normal 93.57%.
The 1-story and 2-story homes were in most demand at 50 % and 30 % respectively. Please, see the chart below.
The data was fairly right-skewed with a few outliers with higher sale prices. After a log transformation of the house sale price, the data became more normally distributed, which improved the regression models.
As a preparation to fit a machine model, I removed outliers. Also, to perform machine learning on a dataset, I created dummies of categorical variables, like MSSubClass, MSZoning, Street, LotShape, LandContour, LotConfig, Neighborhood, Condition1, BldgType, HouseStyle, RoofStyle, Exterior1st, Exterior2nd, etc.
Modeling
After completing the data preprocessing, exploratory data analysis, and feature engineering, I built a few machine-learning models. Models were selected based on which were most likely to have the highest accuracy. I selected three models: Linear Regression, KNN, and Decision Tree. I conducted a train and test split of 30% of the dataset. Each model was trained on the training split. I used grid search and cross-validation to find optimal parameters for tree-based models.
During the model evaluation phase, the linear regression model demonstrated the highest R2 score.
Conclusions
Putting Linear Regression, KNN, and Decision Tree to the test to find the best predictor of sale prices for homes in Ames, Iowa, I discovered that Linear Regression is the best ML model with an R2 of 91.47%.
Next steps
Future work could include:
- Refining the model to improve accuracy like considering the interactions between the variables and working on model stacking
- Incorporating additional features
- Combining multiple models for greater machine learning power.