EDA and machine learning Ames housing price prediction project

Posted on Apr 12, 2023


Buying and investing in the real estate market is one of the biggest decisions people make To be certain that we're making a good real estate purchase, we need to know whether a house is priced fairly or even underpriced.

For this project, I used the Kaggle dataset to predict housing sale prices. The dataset contains 2580 records with 79 attributes for 2006-2010 years with detailed information about each house’s attributes and its sale price. In my analysis, I predicted the price of Ames homes based on features that correlate with sales price, including OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlSF, YearBuilt, FullBath, etc.


Project description

This project aims to analyze and predict the price of the Ames Housing market. This real-world problem project greatly applies data science and machine learning techniques.

Source of the data: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

I was inspired to work on this project because of its relevance to almost everyone who may consider investing in real estate or just purchasing a home.Β  While this home pricing model is linked to one geographic location, it provides insight that is relevant to home prices in general.


Target audience

The target audience is home buyers and real estate investors who need to evaluate home prices in Ames, IA.



We are using the Kaggle dataset with 2580 records with 79 attributes. This dataset is a sample dataset for the Housing Market in Ames, Iowa. The dataset contains the corresponding features for Ames homes, like PID, GrLivArea, SalePrice, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, etc.

You can download this dataset from Kaggle here: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview


Research question

My goal was to build a model using a Kaggle dataset collected from Ames, Iowa that can help both realtors and prospective homeowners accurately price a house based on its available features.


Steps completed:

  • Conducted preprocessing and cleaning of a dataset and feature engineering
  • Performed EDA of the Ames Housing data set, using Python
  • Developed House Sale Price Predictive models – Linear Regression, KNN, and Decision Tree, using Python.


Data Preprocessing and Exploratory data analysis

The dataset contains missing values for 27 variables.

I cleaned and preprocessed the dataset, including removing duplicate rows, examining rows and columns with missing values, imputing some of those missing values, and engineering a few new variables. For example, I removed variables such as Alley, PoolQC, Fence, and MiscFeature with over 80% missing values. Also, I deleted all data with MSZoning C (commercial, agriculture, and industrial) as these are not residential units. I deleted the variable Condition2 because 99% of the values are Norm. I created a few new variables: Age, Slope_Gentle, RR_prox, and BsmtLivArea.

From the correlation matrix of the Ames dataset variable (please, see the matrix below), we found the variables which most correlated with our target variable SalePrice. They are:

We can see the most popular neighborhoods in the city of Ames on a plot below are Names, Old Town, and CollgCr.


From the chart below, we can see that the most popular house sales in the city of Ames were single-family homes and townhomes.

Β Analyzing the conditions of sold houses reveals that most sale conditions were normal 93.57%.


The 1-story and 2-story homes were in most demand at 50 % and 30 % respectively. Please, see the chart below.

The data was fairly right-skewed with a few outliers with higher sale prices.Β  After a log transformation of the house sale price, the data became more normally distributed, which improved the regression models.

As aΒ preparation to fit a machine model, I removed outliers. Also, to perform machine learning on a dataset, I created dummies of categorical variables, like MSSubClass, MSZoning, Street, LotShape, LandContour, LotConfig, Neighborhood, Condition1, BldgType, HouseStyle, RoofStyle, Exterior1st, Exterior2nd, etc.



After completing the data preprocessing, exploratory data analysis, and feature engineering, I built a few machine-learning models. Models were selected based on which were most likely to have the highest accuracy. I selected three models: Linear Regression, KNN, and Decision Tree. I conducted a train and test split of 30% of the dataset. Each model was trained on the training split. I used grid search and cross-validation to find optimal parameters for tree-based models.

During the model evaluation phase, the linear regression model demonstrated the highest R2 score.



Putting Linear Regression, KNN, and Decision Tree to the test to find the best predictor of sale prices for homes in Ames, Iowa, I discovered that Linear Regression is the best ML model with an R2 of 91.47%.


Next steps

Future work could include:

  • Refining the model to improve accuracy like considering the interactions between the variables and working on model stacking
  • Incorporating additional features
  • Combining multiple models for greater machine learning power.

About Author

Diana Dent

A Data Scientist with 15 years of business experience in financial and project management. A proactive, detail-oriented data enthusiast with exceptional problem-solving skills. Interested in contributing SQL, Python, Excel, and Tableau mastery paired with advanced data analytics, machine...
View all posts by Diana Dent >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI