Data-driven Predictions of House Prices in Ames, Iowa

Denise Garbato

Posted on Mar 5, 2022

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

Jones Inc. is a hypothetical real estate agency/development company that wants a model to predict housing prices in Ames, Iowa so they know what they can ask for before putting a house on the market. The data used is: House Prices: Advanced Regression Techniques from Kaggle.

This company:

Sells homes for homeowners
Buys homes, makes improvements and sell the homes for a profit.
Builds new homes.

Base on the data found, the population of Ames is growing. The population in 2010 was 58,965. The population in 2022 is 67,029. That’s a big difference. The rate of home ownership is almost 41%. There's a lot of opportunity for this company to make more money.

The model I built predicts, but also provides information about factors driving price.

Some key features influencing housing prices in Ames, Iowa:

Above grade (ground) living area in square feet
Quality of the overall material and finish of the house
Kitchen quality
House age
Overall condition of the house.

Kitchen quality is important. If a house was built in 1950 and the kitchen was never redone. It is probably worth it to do that before putting the house on the market.

I am going to walk you through the process that I went through to build the model for this company.

Outliers in the data

The first thing I did was remove outliers from the data because they can lead to seriously distorted linear models.

A point on the lower left of the graph below where the sale price is extremely low was removed as were the two points on the extreme right where above ground living area is above 4,000.

Four points on the right of the graph below where lot area is extremely high were excluded.

The point on the extreme right of the graph below where lot frontage is above 300 was removed.

Additional observations dropped

Properties not zoned as residential were excluded.

Two neighborhoods were dropped:

GrnHill is a private senior citizen community and is not listed as a neighborhood in the data description.
Landmrk is not listed as a neighborhood in the data description or online

Categorical features

I imputed categorical features with missing values with No or None as suggested by the data description. For example, where Fence has null values it means No Fence.
Some categorical features were actually ordinal. I recoded these as integers. Example: Kitchen Quality
Dummies were created for linear models
After doing all of this I broke the data out into training data (70%) and test data (30%)

Transformation of sale price data

Log of sale price was used in stead of the actual price because doing so makes it more closely resemble a normal distribution. If the price is normally distributed, the residuals are more likely to be normally distributed which is a key assumption of linear regression.

This a histogram of sale price:

This is a histogram of the log of sale price. This distribution more closely resembles a normal distribution.

Imputation of Numeric Features

I imputed all of the area and bathroom features were imputed with zero because most likely the house just doesn't have that feature. For example, where basement square footage is missing, the house probably just doesn’t have a basement.

These features were imputed with zero:

MasVnrArea
BsmtFullBath,
BsmtHalfBath
BsmtFinSF1
BsmtFinSF2
BsmtUnfSF
TotalBsmtSF
GarageCars

GarageYrBlt, was imputed with the year that the house was built because there is a good chance that the garage was built at the same time.

Lot Frontage

17% of houses have missing values for Lot Frontage so I took a more sophisticated approach.
The median value of Lot Frontage varies by neighborhood as the graph below demonstrates. So, Lot Frontage was imputed with the median value of the neighborhood.

Transformations for linear data models

I looked at the correlation of each continuous feature with log price.
For features that did not have any zero values, I used the Box Cox and log transformations.
For features that did have zero values, I used the log(1 + feature value) and Yeo Johnson transformations.
I used the transformation that had the highest correlation.
If the untransformed feature had the highest correlation, I used that instead.

Features Created

Indicator features created due to poor coverage:

has_pool
has_miscfeature
alley_access

Two discrete features were created:

house_age = YrSold – YearBuilt
years_since_remodeled = YrSold – YearRemodAdd

Additional Indicator features created:

has_wood_deck
has_openporch
has_EnclosedPorch
has_basement
has_finsished_basement

The additional indicators were created in case the feature each was derived from caused any multicollinearity. It turned out that finished basement square footage was highly correlated with total basement square footage. Although it didn’t end up in the final model, I used “has finished basement” instead of “finished basement square footage.”

Grouping of Neighborhoods

There’s a lot of neighborhoods in Ames Iowa. Sale price varies considerably by neighborhood.

Neighborhoods were grouped together based on median sale price.

Multicollinearity

I examined features that intuitively could be correlated, particularly the square footage and area predictors.

The following features had variance inflation factors greater than five:

BsmtFinSF1
BsmtFinSF2
BsmtUnfSF
GarageYrBlt
GarageCars
bc_GrLivArea
bc_LotArea
yeo_TotalBsmtSF
log_first_FlrSF
yeo_GarageArea
yeo_LotFrontage

The variance inflation factor doesn’t indicate what is correlated with what. So, I looked at a correlations matrix.

I consider a correlation of 0.4 or above to be high.

Of the features highly correlated with one another, I kept the ones with the highest correlation with log price.

Out of all the features in the correlations matrix I kept these three:

Box Cox transformation of Above (ground) living area (bc_GrLivArea)
Yeo Johnson Transformation of Total Basement Square Footage (yeo_TotalBsmtSF)
The Box Cox Transformation of Lot Area (bc_LotArea)

Data Models Built

Five different models were built:

Three Linear Models:

Ridge Regression
Lasso Regression
Elastic Net Regression

Two Tree Models:

Random Forest
Gradient Boosting

Linear Models Approach for Data

All features were standardized.
Those features causing multicollinearity were excluded.
In the building of each model in one of the earlier iterations, I excluded dummy features with a mean of less than 0.05. This resulted in a marked reduction in the variance.
Features were sorted by the absolute value of the coefficients in descending order.
Features with coefficients with the smallest absolute values were gradually removed which resulted in a reduction in the RMSE (root mean squared error) of the test data.
Once the RMSE started to increase, additional features were not removed and the model iteration with the smallest RMSE was selected.

Tree Models Approach for Data

All features were standardized.

All categorical features were label encoded.

For Random Forest, depth curves using 500 trees were used to get a sense of what range of hyper parameters to try.

For Gradient Boosting, R-squared curves at different depths were used to determine what depths and learning rates to try as well as the number of trees.

Depth Curves Random Forest

At a depth of 5, the RMSE is too high. Somewhere between depths of 6 and 10, the test error reaches its minimum. That's why I used a range from 6 to 10.

I used a broad range for minimum number of samples for each split and minimum number of samples for each leaf.

data

R-Squared Curves Gradient Boosting

For a depth of three, the curves are almost all at a right angle suggesting that there’s probably overfitting.

data

For a depth of two things look different. Not all curves look like there may be overfitting.

data

Based on the curves above, maximum depths for one and two, a tree range starting at 10,000, and three different learning rates were tried. The stumps worked best with 25,000 trees and a learning rate of 0.01.

Results from Data

data

Elastic Net is the best model. Of the three linear models, it has the smallest root mean squared error, the smallest variance, and the smallest number of features.

The root mean squared error of the gradient boosting model is smaller than that of Elastic Net, but the variance is much higher.

In terms of dollars, the variance of Elastic Net is in the hundreds of dollars whereas with Gradient Boosting, it’s a few thousand.

The underlying relationship is most likely linear given the number of continuous features highly correlated with price.

Feature Importances

Similar predictors had the higher feature importances for the tree models.

Given more time these steps would be taken

I would group the neighborhoods into four categories of similar size. The two largest neighborhood groupings did not end up in the final model nor did the two smallest.
I would exclude correlated features from the tree models. Although mathematically, with the tree models, multicollinearity is not a big issue, there is no need to include redundant features. Doing so makes the feature importances more difficult to interpret and harder to explain to the client.

About Author

Denise Garbato

I am a Statistician and Business Analyst who supports strategic decision making in digital and traditional marketing channels by discovering insights, applying statistical and programming skills with a results-focused approach. I am skilled in data analysis and predictive...

View all posts by Denise Garbato >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

No comments found.

Data-driven Predictions of House Prices in Ames, Iowa

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Background

Outliers in the data

Additional observations dropped

Categorical features

Transformation of sale price data

Imputation of Numeric Features

Transformations for linear data models

Features Created

Grouping of Neighborhoods

Multicollinearity

Data Models Built

Linear Models Approach for Data