House price prediction using machine learning

Posted on Mar 11, 2019


This project was carried out to predict housing price in Ames, Iowa, using supervised machine learning techniques. The Ames housing dataset was collected from Kaggle, a Google-owned online platform for data scientists and machine learning scientists to collaborate and compete. Kaggle features different data/competitions one of which is the Ames Housing dataset, compiled by Dean De Cock.
The dataset has test and train files with 81 features. In the train file, there are 1460 observations while the test file has 1459 observations.

Data Cleaning

The first step towards modelling is data exploration and cleaning, this is done to understand each feature and pattern within the dataset. The train-test files were combined for uniform data engineering and explored for missing values. Below is a map that took a glance into locations of missing values in the data.

The heatmap gave a clue into feature missingness especially columns with high numbers of missing values. Futher analysis was done on the combined data to extract the exact numbers for each feature. 34 columns were observed to have missing values with PoolQC, LotFrontage, FireplaceQual, Fence, Alley, and MiscFeatures columns as the highest ranked. Below is a bar plot that shows how they stack with one another.

Each feature with missing variables was treated differently. Some of the factors considered before imputation were: whether the feature is categorical or numerical and whether the missing value is missing completely at random or missing at random or missing not at random.

For numerical features with values missing at random e.g 'LotFrontage', the median estimate of houses in the same neighborhood was used for imputations. Other numerical features with missing values were mostly filled with the mode which in most cases is zero. For categorical features, some of them appeared to have missing values but the NAs actually means the variable/house notably lacks such feature, for instance,  the NAs in 'PoolQC' means the variable has no 'Pool' and was imputed by 'No_pool'. A similar approach was applied to most of the features.

Removing outliers -  Exploration of the dataset for possible outliers was the next step after imputation. This was done by visualizing each column on scatter plot. 

'AboveGroundLivingArea', BasementSquareFootage' and 'LotFrontage are the features with clear outliers and those variables were dropped from the dataset

Feature Engineering

Prior to engineering the features, correlation analysis was conducted to gain more insight into their variance and correlation with one another and the target feature - SalePrice.

Correlation matrix

From the heatmap, it can be observed that 'GarageCars' is highly correlated to 'GarageArea', 'GarageYearBuilt' is to 'YearBuilt' and 'TotalRoomsAboveGround' is highly correlated to 'GroundLivingArea', these are correlations of 80% and above. This matrix also gave insight into the features that are best suited for engineering.

Feature addition - 9 new features were created from the existing ones, mostly made into binary features. The first feature added "GrYrAfter" which stands for 'Garage year difference' records if a difference exists between the year the building was built and the year garage was built - 'YearBuilt' and 'GarageYrBuilt' features. The potential difference is worth notifying as that might mean the building took a very long time to be completed or some major renovations has taken place, in which case it is necessary to capture. The next new feature added is "Remodelling", two models were generated to capture the difference in remodeling, first is a binary feature that records if the building year is the same as the remodeling year, and second is numeric which records the difference if it exists - (YearRemodAdd - YearBuilt). House built after 1980 and before 1960 were also captured as binary features, the rest are shown in the table below.

Garage year difference (binary)Ground Living Area(log)
Remodeling(binary & numeric)First Floor Square Footage(log)
House built after 1980 (binary)Lot Frontage(boxcox)
House built before 1960(binary)Lot Area(log)
Lot Frontage (binary)
Unfinished Basement (binary)  DROPPED
Low Quality Square Footage (binary) Garage Year Built
Pool Area (binary)

Upon careful analysis of missingness in 'Garage Year Built', the high correlation it possesses and the addition carved out from it, it was dropped as its influence has been captured with fewer complications.

Feature transformation - 4 features were transformed to correct skewness observed in their distribution plot.

LotArea was observed to be abnormally distributed and required transformation, taking the logarithm gave a better distribution as shown in the image below:

Transformation of LotArea feature

Another feature transformed is the 'GroundLivingArea' which was observed to be right skewed, log transformation of the feature also gave a better distribution as seen in the image below:

Transformation of GroundLivingArea feature

Other features are the 'First Floor Square Footage', transformed with logarithm and the 'Lot Frontage', transformed with boxcox.

Lastly on feature transformation is the dependent variable 'SalePrice'. The right skewness observed in the target variable was transformed with log function to derive a normal distribution.

Transformation of target variable - SalePrice

Feature Dummification - categorical features were dummified before modeling. Upon addition, removal and transformation of features, 87 features and 2919 observations were left in the dataframe:

After dummification and removal of one dummy variable from each of the dummified features, 201 columns were added to the dataframe making it a total of 288 features.


The combined dataframe was splitted back into train and test dataframes, with 'train' having 1460 observations and 'test' having 1459 observations. In total, 5 models were trained on the train dataset.

Linear based regressions - namely Ridge, Lasso, and ElasticNet were the first models built for the prediction. These penalized multiple linear regressions techniques were employed as they are able to accommodate multicollinearity among features with minimized prediction error.

Ridge regression - For this model, exploration of number of folds needed for cross validation was done, displayed graphicaly below:

As observed from the plot, as folds increases, the standard deviation increases while mean error remains constant, number of folds used was used 10.

The Coefficient Plot of Ridge Against Regularization/Penalization Strength 𝛼

The image above is the coefficient plot of ridge against regularization/penalization strength 𝛼. As alpha increases, coefficients drop as they are being shrinked towards 0. Not only are the coefficients dropping, the R^2 (R-Squared) which is the statistical measure of the closeness of variables to regression line, also drops. This means the model reduces variance and therefore has introduced bias but generates reduced prediction error - 'bias-variance trade-off'. Alpha = 10, tol = 1e-05, solver='svd', gave the best cross-validation score of 0.1132. This model

Lasso regression - Lasso is also regularized model, just like ridge, employed to mitigate the effect of multicollinearity. Upon optimization of hyperparameters, alpha = 10, max_iter = 25, gave the best cross-validation score of 0.1147.

Elastic-Net model - This is a regularization model that combines ridge and lasso penalization strength.

Again, as with other penalized models, as alpha increases, the coefficients got shrunk towards zero, they dropped down to zero completely on lasso as observed from the graphical representations of the features. Upon tunning, best parameters are l1_ratio = 0.001 and alpha 0.1 which gave cross validation score of 0.11204.

TREE BASED MODELS - Three tree-based models were trained namely Support Vector Machine, Gradient Boosting Regressor and XGBoost.

Support Vector Regressor - For this tree-based model, 3 major parameters were necessary to dictate the tune of the model, these are gamma, epsilon and C. C, the most influential, is a tuning parameter, just like epsilon, that helps determine the threshold of tolerable violations to the margin and hyperplane.

Gridsearch best parameters for the regressor is gamma = .000001, C = 100, epsilon=0, a corroboration of what is observed from the graph. For low root mean squared error, low gamma and high C value is required.

Other models are:

Gradient Boosting Regressor - this model gave cross validation score of 0.12020; and

XGBoost - the last model trained gave 0.11872 cross validation score.

Below is a table that shows all the models used and how they perform, not just locally but on kaggle. The ridge kaggle score of 0.12286 sits in the top 20% of overall kaggle rankings.

Feature Importance

Using the tree-based models, feature importance was plotted, first from Gradient Boosting Regressor:

'OverallQua' feature which rates the overall material and finish of the house has the most impact in the prediction, according to gradient boosting regressor model. This is not surprising as the sale price of the house is expected to increase with its qualities. Next feature of high importance in the prediction is 'GrLivArea' which describes the house's living area in square feet, this implies size of the house impacts its price and the model captured that. The first thing that came to my mind while analyzing the feature importance is the correlation matrix. Do the highly correlated (>80%) features have the same impact on the prediction? Earlier, it was observed that 'GrLivArea' is 83% correlated to 'TotRmsAbvGrd', meaning the living area in square feet and total rooms above ground are correlated. However, in the feature importance observed, while 'GrLivArea' is the most impactful, 'TotRmsAbvGrd' is not in the top 15 important features, this performance by the model confirms Gradient Boosting Regressor is very robust to multicollinearity.

Another model used to analyze feature importance is Random Forest:

The feature importance obtained from random forest model is very similar to what is obtained from gradient boosting. The 3 most impactful features are the same for both of them, although the degree of impact on the models differs. Again, as with gradient boosting, random forest is robust to multicollinearity because it chooses random subset of features for each tree in the random forest.


Stacking the models to see if there can be improvement in the overall prediction.

This study was conducted by three data science fellows: Oluwole Alowolodu, David Levy, and Benjamin Rosen.

About Author

Oluwole Alowolodu

Recent graduate of Biotechnology - MS. Data science fellow and AI enthusiast.
View all posts by Oluwole Alowolodu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI