Machine Learning - Iowa House Price Prediction

Avatar
Posted on Dec 31, 2019

Introduction

The aim of this machine learning project is to  predict the sales prices of different homes based upon a Kaggle Dataset representing 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

The model results will be evaluated based on the Root-Mean-Squared-Error between the logarithm of the predicted values and the logarithm of the observed sales price.

Dataset

The dataset for this project was obtained from Kaggle and was divided into a "training dataset" and a "test dataset". The training data set consisted of 1460 houses (i.e., observations) in addition to 79 attributes (i.e., features, variables, or predictors) as well as the sales price of each house.

On the other hand, the testing set included 1459 houses with the corresponding 79 attributes, but without the sales price since this would be our target variable.

An analysis of the sales price provided the following:

Several data points are displayed above. Note that the average housing sale price in Iowa is $180,921.

Preprocessing

The first step in this stage of the project was to combine the "training dataset" and "test dataset" into one dataframe in order to process all the data. This would make it easier since every "process" applied to the data could be applied to both datasets at the same time. 

The second step was to ascertain the number of "missing variables" in the data set. Some of the features which displayed a high number of missing variables were: "LotFrontage", "Alley", "GarageType", "GarageYrBlt" amongst many others.

The third step involved dropping all features that had more than 1000 missing variables because it was felt that this data  would affect the accuracy of the model. All other features with missing values less than 1000 had those values replaced with the "mean" values of the corresponding features.

The fourth step of preprocessing involved adding dummy variables to the data steps. A large proportion of the housing features in the dataset were categorical variables. In order to better process them, I used the function "pd.get_dummies" which converts the categorical variables into a series of zeros and ones. 

After all these steps were taken, the final action was to split the dataframe back into the "training set" and the "testing set".

Visualization

Several visualizations were undertaken in order to better understand the datasets.

  1. The target variable of the "training dataset" is the sales price of the corresponding houses. The histogram below was created to evaluate the sales price.

The housing sale price between $100K and $200K is the most plentiful.

    2. Analysis of housing sales price and lot area

The chart above examines the relationship between "housing sales prices" and the corresponding "lot areas". There is no clear relationship showing above. Most of the housing "sales prices" are clustered in "lot areas" between "0" and "50,000". 

     3. Analysis of housing sales price and ground living area

The sales price seems to be positively correlated to the ground living area of a house. The larger the ground living area, the larger the price.

    4. Correlation heatmap

A correlation heatmap was also created using "Seaborn". This would enable one to explore what features are most important.

The top 5 related features most correlated to the "sale price are below:

 

These are:

  1. OverallQual: Rates the overall material and finish of the house (1 = Very Poor, 10 = Very Excellent)
  2. GrLivArea: Above grade (ground) living area square feet
  3. GarageCars: Size of garage in car capacity
  4. GarageArea: Size of garage in square feet
  5. TotalBsmtSF: Total square feet of basement area

Modelling Using the Full Feature Set

All the features were used in several different models to determine which one is the best at predicting the housing sales price. To test the models, the data was split into a "70% - 30% train test split". First the models were instantiated and then fitted. The models were fit using "X_Train" and "y_train", and then scored using "X_Test" and "y_test".

The models that were evaluated were: "Linear Regression", "Ridge", and "Lasso". The performances of the models were evaluated using the r-squared value. A high r-squared value means a higher model accuracy.

These are the r-squared related to each model:

  • Linear Regression

R-squared for training dataset - 0.9467

R-squared for test dataset - 0.7996

  • Ridge

R-squared for training dataset - 0.9297

R-squared for test dataset - 0.8908

  • Lasso

R-squared for training dataset - 0.7835

R-squared for test dataset - 0.8205

The "Ridge" regression had the best performance with respect to to the "test dataset" with an r-squared of 0.8908.

Conclusion

In conclusion, the "Ridge" regression was the best model to predict the sales price of the "test dataset". A new dataframe with an "ID" and "SalePrice" column was created:

Greater accuracy may be derived by exploring whether any features can be removed or engineered to add more predictive value. In addition, we can determine if using a sample of the most correlated features would increase or decrease accuracy.

About Author

Avatar

Steven Owusu

Steven Owusu has several years experience working as a credit analyst. He holds a Masters of Business Administration from Columbia Business School. Steven loves applying data science techniques to solving real world business problems.
View all posts by Steven Owusu >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp