House price prediction using machine learning
Introduction
This project was carried out to predict housing price in Ames, Iowa, using supervised machine learning techniques. The Ames housing dataset was collected from Kaggle, a Google-owned online platform for data scientists and machine learning scientists to collaborate and compete. Kaggle features different data/competitions one of which is the Ames Housing dataset, compiled by Dean De Cock.
The dataset has test and train files with 81 features. In the train file, there are 1460 observations while the test file has 1459 observations.
Data Cleaning
The first step towards modelling is data exploration and cleaning, this is done to understand each feature and pattern within the dataset. The train-test files were combined for uniform data engineering and explored for missing values. Below is a map that took a glance into locations of missing values in the data.
The heatmap gave a clue into feature missingness especially columns with high numbers of missing values. Futher analysis was done on the combined data to extract the exact numbers for each feature. 34 columns were observed to have missing values with PoolQC, LotFrontage, FireplaceQual, Fence, Alley, and MiscFeatures columns as the highest ranked. Below is a bar plot that shows how they stack with one another.
Each feature with missing variables was treated differently. Some of the factors considered before imputation were: whether the feature is categorical or numerical and whether the missing value is missing completely at random or missing at random or missing not at random.
For numerical features with values missing at random e.g 'LotFrontage', the median estimate of houses in the same neighborhood was used for imputations. Other numerical features with missing values were mostly filled with the mode which in most cases is zero. For categorical features, some of them appeared to have missing values but the NAs actually means the variable/house notably lacks such feature, for instance, the NAs in 'PoolQC' means the variable has no 'Pool' and was imputed by 'No_pool'. A similar approach was applied to most of the features.
Removing outliers - Exploration of the dataset for possible outliers was the next step after imputation. This was done by visualizing each column on scatter plot.
'AboveGroundLivingArea', BasementSquareFootage' and 'LotFrontage are the features with clear outliers and those variables were dropped from the dataset
Feature Engineering
Prior to engineering the features, correlation analysis was conducted to gain more insight into their variance and correlation with one another and the target feature - SalePrice.
From the heatmap, it can be observed that 'GarageCars' is highly correlated to 'GarageArea', 'GarageYearBuilt' is to 'YearBuilt' and 'TotalRoomsAboveGround' is highly correlated to 'GroundLivingArea', these are correlations of 80% and above. This matrix also gave insight into the features that are best suited for engineering.
Feature addition - 9 new features were created from the existing ones, mostly made into binary features. The first feature added "GrYrAfter" which stands for 'Garage year difference' records if a difference exists between the year the building was built and the year garage was built - 'YearBuilt' and 'GarageYrBuilt' features. The potential difference is worth notifying as that might mean the building took a very long time to be completed or some major renovations has taken place, in which case it is necessary to capture. The next new feature added is "Remodelling", two models were generated to capture the difference in remodeling, first is a binary feature that records if the building year is the same as the remodeling year, and second is numeric which records the difference if it exists - (YearRemodAdd - YearBuilt). House built after 1980 and before 1960 were also captured as binary features, the rest are shown in the table below.
ADDED | TRANSFORMED |
Garage year difference (binary) | Ground Living Area(log) |
Remodeling(binary & numeric) | First Floor Square Footage(log) |
House built after 1980 (binary) | Lot Frontage(boxcox) |
House built before 1960(binary) | Lot Area(log) |
Lot Frontage (binary) | |
Unfinished Basement (binary) | DROPPED |
Low Quality Square Footage (binary) | Garage Year Built |
Pool Area (binary) |
Upon careful analysis of missingness in 'Garage Year Built', the high correlation it possesses and the addition carved out from it, it was dropped as its influence has been captured with fewer complications.
Feature transformation - 4 features were transformed to correct skewness observed in their distribution plot.
LotArea was observed to be abnormally distributed and required transformation, taking the logarithm gave a better distribution as shown in the image below:
Another feature transformed is the 'GroundLivingArea' which was observed to be right skewed, log transformation of the feature also gave a better distribution as seen in the image below:
Other features are the 'First Floor Square Footage', transformed with logarithm and the 'Lot Frontage', transformed with boxcox.
Lastly on feature transformation is the dependent variable 'SalePrice'. The right skewness observed in the target variable was transformed with log function to derive a normal distribution.
Feature Dummification - categorical features were dummified before modeling. Upon addition, removal and transformation of features, 87 features and 2919 observations were left in the dataframe:
After dummification and removal of one dummy variable from each of the dummified features, 201 columns were added to the dataframe making it a total of 288 features.
Modelling
The combined dataframe was splitted back into train and test dataframes, with 'train' having 1460 observations and 'test' having 1459 observations. In total, 5 models were trained on the train dataset.
Linear based regressions - namely Ridge, Lasso, and ElasticNet were the first models built for the prediction. These penalized multiple linear regressions techniques were employed as they are able to accommodate multicollinearity among features with minimized prediction error.
Ridge regression - For this model, exploration of number of folds needed for cross validation was done, displayed graphicaly below:
As observed from the plot, as folds increases, the standard deviation increases while mean error remains constant, number of folds used was used 10.
The image above is the coefficient plot of ridge against regularization/penalization strength 𝛼. As alpha increases, coefficients drop as they are being shrinked towards 0. Not only are the coefficients dropping, the R^2 (R-Squared) which is the statistical measure of the closeness of variables to regression line, also drops. This means the model reduces variance and therefore has introduced bias but generates reduced prediction error - 'bias-variance trade-off'. Alpha = 10, tol = 1e-05, solver='svd', gave the best cross-validation score of 0.1132. This model
Lasso regression - Lasso is also regularized model, just like ridge, employed to mitigate the effect of multicollinearity. Upon optimization of hyperparameters, alpha = 10, max_iter = 25, gave the best cross-validation score of 0.1147.
Elastic-Net model - This is a regularization model that combines ridge and lasso penalization strength.
Again, as with other penalized models, as alpha increases, the coefficients got shrunk towards zero, they dropped down to zero completely on lasso as observed from the graphical representations of the features. Upon tunning, best parameters are l1_ratio = 0.001 and alpha 0.1 which gave cross validation score of 0.11204.
TREE BASED MODELS - Three tree-based models were trained namely Support Vector Machine, Gradient Boosting Regressor and XGBoost.
Support Vector Regressor - For this tree-based model, 3 major parameters were necessary to dictate the tune of the model, these are gamma, epsilon and C. C, the most influential, is a tuning parameter, just like epsilon, that helps determine the threshold of tolerable violations to the margin and hyperplane.
Gridsearch best parameters for the regressor is gamma = .000001, C = 100, epsilon=0, a corroboration of what is observed from the graph. For low root mean squared error, low gamma and high C value is required.
Other models are:
Gradient Boosting Regressor - this model gave cross validation score of 0.12020; and
XGBoost - the last model trained gave 0.11872 cross validation score.
Below is a table that shows all the models used and how they perform, not just locally but on kaggle. The ridge kaggle score of 0.12286 sits in the top 20% of overall kaggle rankings.
Feature Importance
Using the tree-based models, feature importance was plotted, first from Gradient Boosting Regressor:
'OverallQua' feature which rates the overall material and finish of the house has the most impact in the prediction, according to gradient boosting regressor model. This is not surprising as the sale price of the house is expected to increase with its qualities. Next feature of high importance in the prediction is 'GrLivArea' which describes the house's living area in square feet, this implies size of the house impacts its price and the model captured that. The first thing that came to my mind while analyzing the feature importance is the correlation matrix. Do the highly correlated (>80%) features have the same impact on the prediction? Earlier, it was observed that 'GrLivArea' is 83% correlated to 'TotRmsAbvGrd', meaning the living area in square feet and total rooms above ground are correlated. However, in the feature importance observed, while 'GrLivArea' is the most impactful, 'TotRmsAbvGrd' is not in the top 15 important features, this performance by the model confirms Gradient Boosting Regressor is very robust to multicollinearity.
Another model used to analyze feature importance is Random Forest:
The feature importance obtained from random forest model is very similar to what is obtained from gradient boosting. The 3 most impactful features are the same for both of them, although the degree of impact on the models differs. Again, as with gradient boosting, random forest is robust to multicollinearity because it chooses random subset of features for each tree in the random forest.
FUTURE WORK
Stacking the models to see if there can be improvement in the overall prediction.
This study was conducted by three data science fellows: Oluwole Alowolodu, David Levy, and Benjamin Rosen.