Predicting House Prices using Machine Learning: What Features Matter Most?
Estimating the price of a home can be a complex task due to the multitude of different features that can impact their value. Fortunately, machine learning and regression techniques offer a solution to this problem. By using these techniques, we can predict the sale price of homes based on various features and even identify what features have significant impact on price. This information could prove useful to a variety of different people, from real estate agents to first time homebuyers.
In this project, we will use these techniques to explore a dataset from a previous Kaggle competition that looked at house sale prices in Ames, Iowa. The objective was to predict the sale prices of other homes. The original dataset can be viewed here: House Prices - Advanced Regression Techniques | Kaggle.
The dataset includes 1460 records of houses in Ames, Iowa with 81 different features for training and 1459 records of homes for testing the regression and machine learning techniques. For this project, I decided to focus on feature engineering, multiple linear regression models, and a random forest ensemble model to predict house prices.
Data Processing
Log-Transformation of Sale Price
The first step of this project involved processing the data to make it suitable for multiple linear regression (MLR) techniques. Looking at the Sale Price variable, it was clear that it was not normally distributed and exhibited a right skew. To address this, a log transformation was applied to compress the prices and create a more normal distribution. This transformation makes the data more manageable and usable for MLR techniques. The image below displays the original and log-transformed Sale Price distributions along with their skew calculations.
Limited Data Removal
The second processing step focused on removing columns with missing or limited data. Columns with a significant amount of missing data or features that only a small percentage of houses had, were removed from the dataset. These columns provided insufficient information for the models to learn from, such as the Alley or Pool features (only a few houses had these features) and were removed.
Outlier Removal
The third processing step identified and removed any outliers in the dataset. These outliers were identified using scatterplots of the variables against Sale Price to check for normal linear relationships. Two outliers were found and removed to ensure model accuracy. These outliers were two homes with large living room area that sold for lower-than-expected prices. Refer to the image below for the scatterplot with the outliers circled in red.
Linear Assumptions Check
The data was then checked to ensure it met all the assumptions for use in multiple linear regressions.
Linearity
Linearity is the first assumption for MLR techniques. Linear relationships were assessed by looking at scatterplots of all variables against Sale Price to determine if a linear relationship existed. Below is an example of some of the linear relationships found in the data.
Constant Variance of Errors
The residuals vs fitted plot was used to confirm constant error variance, which held true after outlier removal. Below is the plot used to confirm the variance was constant.
Normality of Errors
Errors were plotted as a histogram, and normality was observed after outlier removal and the log-transformation. Below is the histogram of the residuals.
Independent Errors
The residuals vs fitted plot showed no discernible relationship or pattern, indicating independence of errors.
Multicollinearity
A correlation matrix between variables was used to identify and remove highly correlated variables (threshold of 0.7-0.8 or higher). For example, the features GarageArea and GarageCars were highly correlated to eachother, but GarageCars was removed as it provided less descriptive information compared to the area feature. Below is a section of the correlation matrix used.
Models and Scoring
Models
This project utilized three different MLR models and one ensemble model called a Random Forest. Different MLR models were used to assess the impact of different penalization and parameters on scoring. The Random Forest was used to see how an ensemble model would perform against the MLR models.
Lasso
Lasso was the first regression technique used. Lasso Regression is a linear regression technique that introduces a penalty term based on the absolute values of the coefficients. This penalty term has the ability to reduce feature coefficients to 0 if not important for the model prediction. A lower penalization term means less coefficients are reduced to 0. More coefficients mean a more complex model with a larger regression equation for prediction. Cross-validation found the optimal alpha value to be 0.5 resulting in 96 features retained of the original 230.
The model achieved an R2 or coefficient of determination of 0.9069, meaning about 90.69% of the variance in the sale price feature is explained by the independent variables in the model (the 96 features used).
Ridge Regression
Ridge Regression was the second regression technique used. Ridge Regression is also a linear regression technique like Lasso that introduces a penalty term but uses this penalization to reduce coefficients without reducing them to 0. The optimal alpha value was found to be 14. Higher penalization terms mean more coefficients are reduced towards 0. All the features of the model were retained but many were reduced due to the high alpha parameter. Retaining all the features keeps the model as complex as possible.
The model achieved an R2 or coefficient of determination of 0.9077, meaning about 90.77% of the variance in the sale price feature is explained by the independent variables in the model (all 230 features used, some reduced).
Elastic Net
Elastic Net was the last regression technique used. Elastic Net is a combination of Lasso and Ridge regression, using a combination of both their penalization terms. Cross-validation found an optimal alpha parameter of 0.001 and an L1 ratio of 0.5. The L1 ratio is the parameter that controls if the regularization will be closer to Ridge or Lasso. An L1 of 1 is a perfect Lasso and L1 of 0 is a perfect Ridge.
The model achieved an R2 or coefficient of determination of 0.9071, meaning about 90.71% of the variance in the sale price feature is explained by the independent variables in the model.
Random Forest
Random Forest was the last model used to see how ensemble modeling would perform. The dataset was label encoded instead of one-hot encoded for use in the Random Forest.
The optimal parameters for the Random Forest were found using cross-validation with grid-search. The optimal parameters were found to be:
- Max_depth = 20 (longest path from root to terminal node)
- Min_samples_leaf = 2 (minimum number of samples required for each leaf node)
- Min_samples_split = 4 (minimum number of samples required to perform a split at that node)
- N_estimators = 200 (number of individual trees included in the ensemble final model)
Feature importance analysis highlighted overall quality as the most significant feature, likely due to interactions with other features.
Below is a visualization of the feature importance from the model that includes the top features.
Scoring
Root Mean Squared Error
Root Mean Squared Error (RMSE) was used for scoring all the models. RMSE is a common way to measure error by finding the root average squared difference between the observed and predicted values. Below is a bar graph of the RMSE of each model. All were very close in scoring with Ridge having the best score by a slight margin.
Insights and Future Work
Applications
These models hold potential applications in real estate, including home price estimation and aiding homebuyers in determining buying prices or for accurate bidding purposes.
Personal Insights
This project helped emphasize how much feature engineering improved model scoring compared to parameter tuning. I saw the biggest changes in model scoring when effectively engineering features as opposed to tuning a parameter optimally.
Future Work
Future work could involve exploring additional feature engineering techniques, importing external features, and testing more complex models to further enhance performance and scoring.