Price: Machine Learning - Iowa House Price Prediction

Steven Owusu

Posted on Dec 31, 2019

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

The aim of this machine learning project is to predict the sales prices of different homes based upon a Kaggle Dataset representing 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

The model results will be evaluated based on the Root-Mean-Squared-Error between the logarithm of the predicted values and the logarithm of the observed sales price.

Dataset

The dataset for this project was obtained from Kaggle and was divided into a "training dataset" and a "test dataset". The training data set consisted of 1460 houses (i.e., observations) in addition to 79 attributes (i.e., features, variables, or predictors) as well as the sales price of each house.

On the other hand, the testing set included 1459 houses with the corresponding 79 attributes, but without the sales price since this would be our target variable.

An analysis of the sales price provided the following:

Several data points are displayed above. Note that the average housing sale price in Iowa is $180,921.

Preprocessing

The first step in this stage of the project was to combine the "training dataset" and "test dataset" into one dataframe in order to process all the data. This would make it easier since every "process" applied to the data could be applied to both datasets at the same time.

The second step was to ascertain the number of "missing variables" in the data set. Some of the features which displayed a high number of missing variables were: "LotFrontage", "Alley", "GarageType", "GarageYrBlt" amongst many others.

The third step involved dropping all features that had more than 1000 missing variables because it was felt that this data would affect the accuracy of the model. All other features with missing values less than 1000 had those values replaced with the "mean" values of the corresponding features.

The fourth step of preprocessing involved adding dummy variables to the data steps. A large proportion of the housing features in the dataset were categorical variables. In order to better process them, I used the function "pd.get_dummies" which converts the categorical variables into a series of zeros and ones.

After all these steps were taken, the final action was to split the dataframe back into the "training set" and the "testing set".

Visualization

Several visualizations were undertaken in order to better understand the datasets.

The target variable of the "training dataset" is the sales price of the corresponding houses. The histogram below was created to evaluate the sales price.

The housing sale price between $100K and $200K is the most plentiful.

2. Analysis of housing sales price and lot area

The chart above examines the relationship between "housing sales prices" and the corresponding "lot areas". There is no clear relationship showing above. Most of the housing "sales prices" are clustered in "lot areas" between "0" and "50,000".

3. Analysis of housing sales price and ground living area

The sales price seems to be positively correlated to the ground living area of a house. The larger the ground living area, the larger the price.

4. Correlation heatmap

A correlation heatmap was also created using "Seaborn". This would enable one to explore what features are most important.

The top 5 related features most correlated to the "sale price are below:

These are:

OverallQual: Rates the overall material and finish of the house (1 = Very Poor, 10 = Very Excellent)
GrLivArea: Above grade (ground) living area square feet
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
TotalBsmtSF: Total square feet of basement area

Modelling Using the Full Feature Set

All the features were used in several different models to determine which one is the best at predicting the housing sales price. To test the models, the data was split into a "70% - 30% train test split". First the models were instantiated and then fitted. The models were fit using "X_Train" and "y_train", and then scored using "X_Test" and "y_test".

The models that were evaluated were: "Linear Regression", "Ridge", and "Lasso". The performances of the models were evaluated using the r-squared value. A high r-squared value means a higher model accuracy.

These are the r-squared related to each model:

Linear Regression

R-squared for training dataset - 0.9467

R-squared for test dataset - 0.7996

Ridge

R-squared for training dataset - 0.9297

R-squared for test dataset - 0.8908

Lasso

R-squared for training dataset - 0.7835

R-squared for test dataset - 0.8205

The "Ridge" regression had the best performance with respect to to the "test dataset" with an r-squared of 0.8908.

Conclusion

In conclusion, the "Ridge" regression was the best model to predict the sales price of the "test dataset". A new dataframe with an "ID" and "SalePrice" column was created:

Greater accuracy may be derived by exploring whether any features can be removed or engineered to add more predictive value. In addition, we can determine if using a sample of the most correlated features would increase or decrease accuracy.

About Author

Steven Owusu

Steven Owusu has several years experience working as a credit analyst. He holds a Masters of Business Administration from Columbia Business School. Steven loves applying data science techniques to solving real world business problems.

View all posts by Steven Owusu >

No comments found.

Price: Machine Learning - Iowa House Price Prediction

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Dataset

Preprocessing

Visualization

Modelling Using the Full Feature Set

Conclusion

About Author

Steven Owusu

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Price: Machine Learning - Iowa House Price Prediction

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Dataset

Preprocessing

Visualization

Modelling Using the Full Feature Set

Conclusion

About Author

Steven Owusu

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!