Data-driven Predictions of House Prices in Ames, Iowa
The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Background
Jones Inc. is a hypothetical real estate agency/development company that wants a model to predict housing prices in Ames, Iowa so they know what they can ask for before putting a house on the market. The data used is: House Prices: Advanced Regression Techniques from Kaggle.
This company:
- Sells homes for homeowners
- Buys homes, makes improvements and sell the homes for a profit.
- Builds new homes.
Base on the data found, the population of Ames is growing. The population in 2010 was 58,965. The population in 2022 is 67,029. That’s a big difference. The rate of home ownership is almost 41%. There's a lot of opportunity for this company to make more money.
The model I built predicts, but also provides information about factors driving price.
Some key features influencing housing prices in Ames, Iowa:
- Above grade (ground) living area in square feet
- Quality of the overall material and finish of the house
- Kitchen quality
- House age
- Overall condition of the house.
Kitchen quality is important. If a house was built in 1950 and the kitchen was never redone. It is probably worth it to do that before putting the house on the market.
I am going to walk you through the process that I went through to build the model for this company.
Outliers in the data
The first thing I did was remove outliers from the data because they can lead to seriously distorted linear models.
A point on the lower left of the graph below where the sale price is extremely low was removed as were the two points on the extreme right where above ground living area is above 4,000.
Four points on the right of the graph below where lot area is extremely high were excluded.
The point on the extreme right of the graph below where lot frontage is above 300 was removed.
Additional observations dropped
Properties not zoned as residential were excluded.
Two neighborhoods were dropped:
- GrnHill is a private senior citizen community and is not listed as a neighborhood in the data description.
- Landmrk is not listed as a neighborhood in the data description or online
Categorical features
- I imputed categorical features with missing values with No or None as suggested by the data description. For example, where Fence has null values it means No Fence.
- Some categorical features were actually ordinal. I recoded these as integers. Example: Kitchen Quality
- Dummies were created for linear models
- After doing all of this I broke the data out into training data (70%) and test data (30%)
Transformation of sale price data
Log of sale price was used in stead of the actual price because doing so makes it more closely resemble a normal distribution. If the price is normally distributed, the residuals are more likely to be normally distributed which is a key assumption of linear regression.
This a histogram of sale price:
This is a histogram of the log of sale price. This distribution more closely resembles a normal distribution.
Imputation of Numeric Features
I imputed all of the area and bathroom features were imputed with zero because most likely the house just doesn't have that feature. For example, where basement square footage is missing, the house probably just doesn’t have a basement.
These features were imputed with zero:
- MasVnrArea
- BsmtFullBath,
- BsmtHalfBath
- BsmtFinSF1
- BsmtFinSF2
- BsmtUnfSF
- TotalBsmtSF
- GarageCars
GarageYrBlt, was imputed with the year that the house was built because there is a good chance that the garage was built at the same time.
Lot Frontage
- 17% of houses have missing values for Lot Frontage so I took a more sophisticated approach.
- The median value of Lot Frontage varies by neighborhood as the graph below demonstrates. So, Lot Frontage was imputed with the median value of the neighborhood.
Transformations for linear data models
- I looked at the correlation of each continuous feature with log price.
- For features that did not have any zero values, I used the Box Cox and log transformations.
- For features that did have zero values, I used the log(1 + feature value) and Yeo Johnson transformations.
- I used the transformation that had the highest correlation.
- If the untransformed feature had the highest correlation, I used that instead.
Features Created
Indicator features created due to poor coverage:
- has_pool
- has_miscfeature
- alley_access
Two discrete features were created:
- house_age = YrSold – YearBuilt
- years_since_remodeled = YrSold – YearRemodAdd
Additional Indicator features created:
- has_wood_deck
- has_openporch
- has_EnclosedPorch
- has_basement
- has_finsished_basement
The additional indicators were created in case the feature each was derived from caused any multicollinearity. It turned out that finished basement square footage was highly correlated with total basement square footage. Although it didn’t end up in the final model, I used “has finished basement” instead of “finished basement square footage.”
Grouping of Neighborhoods
There’s a lot of neighborhoods in Ames Iowa. Sale price varies considerably by neighborhood.
Neighborhoods were grouped together based on median sale price.
Multicollinearity
I examined features that intuitively could be correlated, particularly the square footage and area predictors.
The following features had variance inflation factors greater than five:
- BsmtFinSF1
- BsmtFinSF2
- BsmtUnfSF
- GarageYrBlt
- GarageCars
- bc_GrLivArea
- bc_LotArea
- yeo_TotalBsmtSF
- log_first_FlrSF
- yeo_GarageArea
- yeo_LotFrontage
The variance inflation factor doesn’t indicate what is correlated with what. So, I looked at a correlations matrix.
I consider a correlation of 0.4 or above to be high.
Of the features highly correlated with one another, I kept the ones with the highest correlation with log price.
Out of all the features in the correlations matrix I kept these three:
- Box Cox transformation of Above (ground) living area (bc_GrLivArea)
- Yeo Johnson Transformation of Total Basement Square Footage (yeo_TotalBsmtSF)
- The Box Cox Transformation of Lot Area (bc_LotArea)
Data Models Built
Five different models were built:
Three Linear Models:
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
Two Tree Models:
- Random Forest
- Gradient Boosting
Linear Models Approach for Data
- All features were standardized.
- Those features causing multicollinearity were excluded.
- In the building of each model in one of the earlier iterations, I excluded dummy features with a mean of less than 0.05. This resulted in a marked reduction in the variance.
- Features were sorted by the absolute value of the coefficients in descending order.
- Features with coefficients with the smallest absolute values were gradually removed which resulted in a reduction in the RMSE (root mean squared error) of the test data.
- Once the RMSE started to increase, additional features were not removed and the model iteration with the smallest RMSE was selected.
Tree Models Approach for Data
All features were standardized.
All categorical features were label encoded.
For Random Forest, depth curves using 500 trees were used to get a sense of what range of hyper parameters to try.
For Gradient Boosting, R-squared curves at different depths were used to determine what depths and learning rates to try as well as the number of trees.
Depth Curves Random Forest
At a depth of 5, the RMSE is too high. Somewhere between depths of 6 and 10, the test error reaches its minimum. That's why I used a range from 6 to 10.
I used a broad range for minimum number of samples for each split and minimum number of samples for each leaf.
R-Squared Curves Gradient Boosting
For a depth of three, the curves are almost all at a right angle suggesting that there’s probably overfitting.
For a depth of two things look different. Not all curves look like there may be overfitting.
Based on the curves above, maximum depths for one and two, a tree range starting at 10,000, and three different learning rates were tried. The stumps worked best with 25,000 trees and a learning rate of 0.01.
Results from Data
Elastic Net is the best model. Of the three linear models, it has the smallest root mean squared error, the smallest variance, and the smallest number of features.
The root mean squared error of the gradient boosting model is smaller than that of Elastic Net, but the variance is much higher.
In terms of dollars, the variance of Elastic Net is in the hundreds of dollars whereas with Gradient Boosting, it’s a few thousand.
The underlying relationship is most likely linear given the number of continuous features highly correlated with price.
Feature Importances
Similar predictors had the higher feature importances for the tree models.
Given more time these steps would be taken
- I would group the neighborhoods into four categories of similar size. The two largest neighborhood groupings did not end up in the final model nor did the two smallest.
- I would exclude correlated features from the tree models. Although mathematically, with the tree models, multicollinearity is not a big issue, there is no need to include redundant features. Doing so makes the feature importances more difficult to interpret and harder to explain to the client.