ML: Predicting House Prices

Muhammad Ihsanulhaq Sarfraz
Posted on Jan 30, 2020


There has been a rise in machine learning applications in real estate. To that end, this project seeks to build a model to predict the final sale prices of each home found in the Ames Housing dataset. The data uses 79 variables to describe different aspects of the residential homes. This project will use exploratory data analysis, data cleaning, feature engineering and modeling using linear and non-linear models to achieve the goal of predicting house prices.


The data consists of 36 numerical and 43 categorical variables with 1460 instances of training data and 1460 of test data. The source code for this analysis and modeling can be found in Github. As a first step, the relationship between some of the numerical data types and SalePrice is explored, namely GrLivArea and TotalBsmntSF. There appears to be a strong linear relationship between GrLivArea and SalePrice. There is also a somewhat linear relationship between TotalBsmntSF and SalePrice, however, in certain cases, SalePrice seems to have less of an impact than TotalBsmntSF.

In addition to analyzing and exploring numerical variables, categorical variables are also explored namely, OverallQual and YearBuilt. The boxplot below demonstrates that the OverallQual is directly representative of a SalePrice. Better the quality, higher the SalePrice. OverallQual and YearBuilt also seem to be related to SalePrice. The relationship seems to be stronger in the case of OverallQual, where the box plot shows how sales prices increase with the overall quality.

In order to understand how the dependent variable and independent variables are related, a correlation matrix heatmap is prepared to visualize the correlation between features. The correlation heatmap below shows how strongly pairs of features are related. To further narrow the dependency and zoom into the 10 most correlated features, a correlation matrix heatmap of top 10 correlated features is shown.

  • 'OverallQual', 'GrLivArea' and 'TotalBsmtSF' are strongly correlated with 'SalePrice'.
  • 'GarageCars' and 'GarageArea' are also some of the most strongly correlated variables. However, the number of cars that fit into the garage is a consequence of the garage area.
  • 'TotalBsmtSF' and '1stFloor' also seem to strongly correlated and can be merged as one feature.
  • 'TotRmsAbvGrd' and 'GrLivArea', are also correlated and can be merged as one feature
  • 'YearBuilt' is slightly correlated with 'SalePrice'.

With the data exploration completed, the next step is to clean data and impute missing values. The table below shows the missingness statistics of the features in the dataset. Most of the missing entries are due to the house missing a feature: for example, garage, basement, pool, fence, alley. These can be replaced with binary values. Other variables such as LotFrontage, MasVnr, are imputed using the neighboring median value.

Total Percent
PoolQC1453 0.995205
MiscFeature1406 0.963014
Alley1369 0.937671
Fence1179 0.807534
FireplaceQu690 0.472603
LotFrontage259 0.177397
GarageCond81 0.055479
GarageType81 0.055479
GarageYrBlt81 0.055479
GarageFinish81 0.055479
GarageQual81 0.055479
BsmtExposure38 0.026027
BsmtFinType238 0.026027
BsmtFinType137 0.025342
BsmtCond37 0.025342
BsmtQual37 0.025342
MasVnrArea8 0.005479
MasVnrType8 0.005479
Electrical1 0.000685
Utilities0 0.000000

With the data cleaned and missing data imputed, there can be introduced some feature engineering to better set up the data for modeling. Some features such as Street and Utilities can be dropped since they are not correlated to SalePrice. Other features such as floor space, bathroom, and garage features can be combined to form a single feature. Some features such as KitchenAbvGr and HalfBath can be converted to categorical features so they can be easily recognized by the model. Finally, dummification of the categorical variables is done to be able to use, linear regression models. 

The data is now ready for modeling. Both linear and non-linear models are used to predict SalePrice. The training data is divided into a 80-20 split. The linear models used are Ridge, Lasso and ElasticNet regression and the non-linear models used are Support Vector Machine, Random Forest and Gradient Boosting. The table below shows the R-squared and Mean Square Error values for the different models.

gradient boost    0.102793
ridge             0.117614
lasso             0.119479
ElasticNet        0.119755
forest            0.138547
svm               0.764708
dtype: float64
svm               0.001828
forest            0.001879
gradient boost    0.002742
ridge             0.011920
lasso             0.013152
ElasticNet        0.013267

The following table shows a snippet of the predicted house SalePrice given by different models.

IdridgelassoElasticNetforestsvmgradient boosting


Linear models such as Ridge, Lasso, ElasticNet are simple to implement and easy to tune and performed relatively similar. Non-linear model, SVM, reported the least mean square error and the highest R-squared score.

About Author

Muhammad Ihsanulhaq Sarfraz

Muhammad Ihsanulhaq Sarfraz

Ihsan is an NYC Data Science Academy Fellow currently pursuing his PhD in Computer Engineering from Purdue University with a dissertation on analyzing patterns of learner behaviors in MOOCs. He has a passion for building dashboards and interfaces...
View all posts by Muhammad Ihsanulhaq Sarfraz >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp