Price: Machine Learning - Iowa House Price Prediction

Posted on Dec 31, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The aim of this machine learning project is toΒ  predict the sales prices of different homes based upon a Kaggle Dataset representing 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

The model results will be evaluated based on the Root-Mean-Squared-Error between the logarithm of the predicted values and the logarithm of the observed sales price.


The dataset for this project was obtained from Kaggle and was divided into a "training dataset" and a "test dataset". The training data set consisted of 1460 houses (i.e., observations) in addition to 79 attributes (i.e., features, variables, or predictors) as well as the sales price of each house.

On the other hand, the testing set included 1459 houses with the corresponding 79 attributes, but without the sales price since this would be our target variable.

An analysis of the sales price provided the following:

Several data points are displayed above. Note that the average housing sale price in Iowa is $180,921.


The first step in this stage of the project was to combine the "training dataset" and "test dataset" into one dataframe in order to process all the data. This would make it easier since every "process" applied to the data could be applied to both datasets at the same time.Β 

The second step was to ascertain the number of "missing variables" in the data set. Some of the features which displayed a high number of missing variables were: "LotFrontage", "Alley", "GarageType", "GarageYrBlt" amongst many others.

The third step involved dropping all features that had more than 1000 missing variables because it was felt that this dataΒ  would affect the accuracy of the model. All other features with missing values less than 1000 had those values replaced with the "mean" values of the corresponding features.

The fourth step of preprocessing involved adding dummy variables to the data steps. A large proportion of the housing features in the dataset were categorical variables. In order to better process them, I used the function "pd.get_dummies" which converts the categorical variables into a series of zeros and ones.Β 

After all these steps were taken, the final action was to split the dataframe back into the "training set" and the "testing set".


Several visualizations were undertaken in order to better understand the datasets.

  1. The target variable of the "training dataset" is the sales price of the corresponding houses. The histogram below was created to evaluate the sales price.

The housing sale price between $100K and $200K is the most plentiful.

Β  Β  2. Analysis of housing sales price and lot area

The chart above examines the relationship between "housing sales prices" and the corresponding "lot areas". There is no clear relationship showing above. Most of the housing "sales prices" are clustered in "lot areas" between "0" and "50,000".Β 

Β  Β  Β 3. Analysis of housing sales price and ground living area

The sales price seems to be positively correlated to the ground living area of a house. The larger the ground living area, the larger the price.

Β  Β  4. Correlation heatmap

A correlation heatmap was also created using "Seaborn". This would enable one to explore what features are most important.

The top 5 related features most correlated to the "sale price are below:


These are:

  1. OverallQual: Rates the overall material and finish of the house (1 = Very Poor, 10 = Very Excellent)
  2. GrLivArea: Above grade (ground) living area square feet
  3. GarageCars: Size of garage in car capacity
  4. GarageArea: Size of garage in square feet
  5. TotalBsmtSF: Total square feet of basement area

Modelling Using the Full Feature Set

All the features were used in several different models to determine which one is the best at predicting the housing sales price. To test the models, the data was split into a "70% - 30% train test split". First the models were instantiated and then fitted. The models were fit using "X_Train" and "y_train", and then scored using "X_Test" and "y_test".

The models that were evaluated were: "Linear Regression", "Ridge", and "Lasso". The performances of the models were evaluated using the r-squared value. A high r-squared value means a higher model accuracy.

These are the r-squared related to each model:

  • Linear Regression

R-squared for training dataset - 0.9467

R-squared for test dataset - 0.7996

  • Ridge

R-squared for training dataset - 0.9297

R-squared for test dataset - 0.8908

  • Lasso

R-squared for training dataset - 0.7835

R-squared for test dataset - 0.8205

The "Ridge" regression had the best performance with respect to to the "test dataset" with an r-squared of 0.8908.


In conclusion, the "Ridge" regression was the best model to predict the sales price of the "test dataset". A new dataframe with an "ID" and "SalePrice" column was created:

Greater accuracy may be derived by exploring whether any features can be removed or engineered to add more predictive value. In addition, we can determine if using a sample of the most correlated features would increase or decrease accuracy.

About Author

Steven Owusu

Steven Owusu has several years experience working as a credit analyst. He holds a Masters of Business Administration from Columbia Business School. Steven loves applying data science techniques to solving real world business problems.
View all posts by Steven Owusu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI