# Fireplaces in Ames Cost HOW MUCH Now?

## Introduction

Imagine you're a postdoctoral scholar in 2010 who has toiled through countless sleepless nights to perfect your experiments in molecular chemistry and publish a vast array of papers in the most prestigious journals. With this glowing curriculum vitae, the possibilities for your next move in the esteemed world of academia are endless! You could begin your endeavors at a prestigious private institution such as Harvard, Stanford, or UChicago. Or your may opt public universities in California, Michigan, or Florida that provide plenty of funding and a pipeline of students who can keep your lab running. With such an impressive body of work, the academic world is your oyster.

After interviewing with countless departments, the results are in! You'll start your illustrious academic career with a tenure track job in... Iowa State University in Ames, Iowa. Yeah, maybe not quite the NCAA football powerhouse you and your spouse were hoping to push your future kids into for those lucrative NIL deals, but hey, you have a job and you should feel blessed. Once you're accepted the offer, the next goal is to find a home. (I doubt in this day and age the most economical postdocs can afford a single-family home even in the smallest of towns, but 2010 was a different time fresh off a housing market crash, so please suspend your disbelief.)

While researching the housing market in Ames, your curiosity is piqued by this very comprehensive dataset from Kaggle that contains information on sale prices and a variety of attributes for over 2,500 houses in the city built in or before 2010. Upon perusing the data, you realize that you can gain quantitative insight into which housing features might impact the sale price the most by developing machine learning models that can achieve different objectives, depending on whether you want to predict the price with strong accuracy or clearly describe how modifying various features changes the price. These models can inform not only how homebuyers might make a choice on what kind of housing to purchase, but also what features can improve the value of a home upon resale.

Using Python in a Jupyter Notebook environment, my process of analyzing and modeling the data, with some cyclical iteration through some steps, is as follows:

- Preprocess the data by imputing null values, changing data types and engineering new features.
- Identify various tree-based machine learning models that can be tested to develop a predictive model.
- Tune the hyper-parameters of each model through grid searching.
- Select the best model based on the cross-validation score.
- Identify the most impactful features on sale price through feature selection.
- Construct a descriptive model using linear regression with that selected feature subset.

## Pre-Processing

The first step in pre-processing the dataset before training and testing any machine learning models was to fill all null values for input variables. This is important because the presence of any would lead to runtime errors in the training process. The method of imputation depends on the data type and feature that contains the nulls:

- For all categorical features and ordinal ratings features with non-numerical ratings (generally their values ranged from "Po" for poor to "Ex" for excellent), where the data type was a string, the nulls were replaced with a "None" string. This was done since those nulls reflected a lack of that attribute for that observation. (
**Note:**the "None" input had to be a string, and NOT the preset None value used as a placeholder in Python). - For this same reason, most numerical features where the data type was an integer or float had their nulls replaced with 0. However, one exception was an observation where the GarageCars and GarageArea features had null values even though the house contained a garage. Because the garage type was labeled as detached, I imputed the null values for this observation to match the values for the other house with a detached garage. That is 1 for GarageCars and 360 for GarageArea.

Next I made sure to recast the MSSubClass feature, which encodes numerical values for all the building classes in the dataset, from an integer to a string. That enables it to be properly inputted as a categorical feature. Without this recasting, the models risk improperly fitting the data in the training process and excessively elevating the feature's importance for descriptive value when performing feature selection.

Finally, I engineered new features that helped normalize various features of the data pertaining to area and interpret the models more easily (i.e. how the price of a house changes for each additional 1% of a whole area that a room/other attribute of the house would compose). The one new feature not related to area was HouseAge, which calculated how many years before 2010 a house was built. Area ratio features were engineered as the table below specifies, with all null values where the whole area was zero imputed to zero as well:

Whole Area | Smaller Space Area |

GrLivArea (above ground living) | 1stFlrSF, 2ndFlrSF |

TotalBsmtSF (total basement) | BsmtFinSF1 (first finished basement section), BsmtFinSF2 (second finished basement section), BsmtUnfSF (unfinished basement section) |

LotArea (entire lot) | GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch (three season porch), ScreenPorch, PoolArea |

From plotting the distributions of these new area ratio features, as shown below, we can see that a lot of these variables are still sparse because of the lack of data observations with that corresponding attribute, as shown by the tall bars at value 0 (or 1 for 1stFlrRatio because the first floor was the only living area above ground for those observations, which means 2ndFlrRatio was correspondingly 0). Nevertheless, having all these area-related features set to the same scale will be useful for interpreting models, particularly descriptive ones constructed through such techniques as linear regression.

As for predictive tree-based models, which variables are nonzero and more spread in their distribution for observations where other features are zero could inform to a degree which engineered features have more importance than others. The GarageRatio feature stands out in particular. That makes sense in light of the fact that a garage is more of a necessity than nice-to-have features like a screen porch or a pool, for which the area values are almost entirely zero across the data.

## Predictive Model Training and Optimization

Next I detail how I trained and optimized various regression models, imported from the Scikit-Learn library for Python-based machine learning. This enabled me to get the best regression score for optimal predictive value and also to identify the most important features that could then be inputted into a more descriptive and interpretable regression model.

### Machine Learning Models

It's important to note that there is usually a tradeoff between predictive accuracy and interpretability involved in the base estimator for these models - the singular decision tree. If the goal is to have a model that's easier to understand, then you risk a more biased model due to the lower number of split node levels leading to less accurate predictions against the training data. Conversely, a decision tree with more levels can be more accurate, though its large number of conditions renders it much tougher to interpret. Such a model has higher variance, as it fits the training data well, but it's often a rather poor fit for the testing data.

This tradeoff can be resolved by utilizing tree-based ensemble models. Ensembles of decision trees are constructed either iteratively or recursively in the training process depending on the type of model used. This allows for simpler base estimator structures being aggregated in a manner that yields stronger accuracy in the regression score, even if the ensembles generally lend themselves to less interpretability. The ensemble models used are as follows:

- Bagging, where multiple subsets of the data are sampled to train simple estimators in parallel that are then averaged to form a final model
- Random forest, which is a similar technique to bagging but also trains parallel estimators on subsets of data features as well as observations
- AdaBoost, which recursively builds upon a base estimator by fitting new tree models on training data observations with the highest residual errors against the predicted values
- Gradient Boost, which works similarly to AdaBoost but recursively builds upon the estimator by fitting models on the residual errors themselves instead of on the data values

### Grid Searching and Cross-Validation

When testing the decision tree and the four ensemble models described above, each of those five models has a different set of hyper-parameters tuned in order to optimize its predictive accuracy. The metric used in the optimization process is the 5-fold cross-validation score. For a given set of model parameters, all observations are assigned to five folds, each of which has a different train-test split of the dataset where the model is built using the training data and then gets scored against the testing data. The average of those testing scores is outputted, and this process is iterated over a grid of all permutations of parameters specified until the best score is found. The parameters and their grid search values tested for each model are shown below.

Decision Tree:

Random Forest:

Bagging:

AdaBoost:

Gradient Boost:

The parameter names denote the following:

- min_samples_leaf - minimum number of observations required to populate a leaf
- min_samples_split - minimum number of observations required to add a split node at a leaf
- max_depth - maximum depth of the simple decision tree model
- n_estimators - number of base decision tree estimators calculated in the ensemble model
- max_samples - maximum portion of data observations sampled in bagging
- learning_rate - rate at which boosting model is recursively built on top of base estimator

### Model Selection and Feature Importance

After the optimization process is finished for all five tree-based models, the model with the best cross-validation score overall is selected. The means and standard deviations of the tuned models' cross-validation scores across all five data folds are shown below.

Based on these figures, the model with the best mean score appears to be the tuned gradient boosting regressor, which means this provides the best predictive value in terms of determining housing sale price based on the features provided in the data. The downside to this model is that there is greater variability across its five cross-validation scores than for the other four models, especially compared to the other three ensemble models. However, this is an acceptable tradeoff to make if the goal is simply to get the best predictive accuracy, especially since the scale of the standard deviation is similar across all five models and rather miniscule.

The graph below shows the relative importance for all five tuned models across a subset of the most important data features.

The three most important features that are common across all models are OverallQual (the overall quality rating of a house), GrLivArea and TotalBsmtSF. All other features matter much less to the models in terms of affecting their accuracy and to varying degrees as well. This shows that a simpler and more descriptive model could be constructed with just those three features and possibly any others that might provide greater accuracy for that model. The question now is: how do we determine those features?

## Descriptive Model Training

### Sequential Feature Selection

From the feature importance values of the tuned gradient boost model, i.e. the predictive regressor, we can get an idea of what feature subset can be drawn from the data to create a simpler linear regression model that is inherently more descriptive because it assigns a change in the target variable per unit change of a given input feature. However, in order to more precisely determine which features help such a model retain a strong regression score that is not significantly worse than that of the predictive model, we need to run cross-validation with the tuned gradient boost estimator on different subsets of the data where all permutations of a specified number of features is selected, in a process is known as sequential feature selection (SFS).

First I determine what is the optimal number of selected features above which there is minimal improvement in the regression score when running cross-validation with the tuned gradient boost regressor on the selected features. To do this, I iterated over a range of numbers of selected features from 1 to 20 and ran SFS to determine the regression score from cross-validating the predictive regression model using the data subset with that number of features. At each iteration, SFS runs cross-validation with every permutation of that given number of selected features to determine which feature subset yields the best score. Below is a graph that shows that score as a function of the number of features selected in SFS.

Around nine features selected in the data subset, each additional feature adds little to no improvement to the regression score. Therefore, nine features is an optimal subset to use when developing a condensed linear regression model. That is especially useful when the goal is to avoid having too many features that can increase the risk of multicollinearity. The nine features that will be selected for developing the linear regression model are as follows:

- OverallQual (on 1-10 scale)
- GrLivArea
- TotalBsmtSF
- HouseAge
- Fireplaces (number of fireplaces)
- OverallCond (overall condition rating on 1-10 scale)
- Neighborhood (which neighborhood in Ames the house is located in)
- GarageRatio (ratio of garage area to lot area)
- BsmtUnfRatio (ratio of unfinished basement area to total basement area)

### Linear Regression Model

With the nine most important features identified, I now demonstrate how a linear regression model is developed to determine how those variables might affect the housing sale price. Before running any model fitting, it's important to ensure that the fundamental assumptions of linearity are not violated by the data. While multicollinearity is tough to fully avoid in most multiple linear regression problems in the real world, that risk is mitigated by minimizing the number of selected features compared to the total number of features in the dataset. Therefore, the main assumption to validate is if the target variable is normally distributed.

Below are two histogram plots, one on the left showing the distribution and density overlay of the sale price target feature, and another on the right also showing the distribution of the sale price, though this one is on a log scale. It's evident that the untransformed target feature does not follow a normal distribution as it is right-skewed. After a log transformation is applied, though, the skewness is removed. Therefore, it's best to develop a linear regression model fitting the nine selected variables against the log-transformed sale price.

Below is a formulation of how to interpret the linear model fitting the log of the sale price to the selected feature subset of the data. Essentially it yields an exponential model in which each coefficient, instead of showing how the sale price changes linearly with each step for a given input variable, shows the *relative* change in that price with each input variable step. This will be elaborated on later with some examples. Also, in order to fit the model against the Neighborhood feature, binary dummy variables were created to encode the neighborhood for each observation, with the NAmes (North Ames neighborhood) column dropped to remove collinearity between the new dummy variables.

The R^{2} score for this model is 0.895, which means that 89.5% of the variability of the log of the sale price is explained by the nine selected features. This is an improvement on the score of a regression model fitting the untransformed sale price against the data subset, which is around 0.86. Thus the approach to developing an exponential model for the sale price via a linear model for the log-transformed sale price is validated by the higher R^{2} score.

As for the parameters of the model itself, the intercept of the linear model comes out to 10.7374. Accordingly the price of a house where all features are zero, which corresponds to an empty house in North Ames, is $46,044. (This is obviously not a real house, but it sets the baseline for the model.) The coefficients for the other numerical features are shown below, as well as the resultant percent change in the sale price per unit change in that variable. The scatterplots show the relationship between sale price and the corresponding features.

In order to illustrate more clearly what these percentage changes in the sale price for each unit step of a given feature mean, below are some tables showing how the price of a house would change if a feature like Fireplaces or GarageRatio changed while everything else is kept the same. For example, a house with one fireplace that costs $100,000 would see a 4.535%, or $4,500, increase in its sale price if it added another fireplace to now have two, and the price would increase further by another $4,800, also an 8% increase, were it to get three fireplaces. A more expensive house priced at $200,000 with one fireplace would see a higher dollar amount increase even with the same rate of increase at 8%, going up by $9,100 at two fireplaces and then by another $9,500 at three.

Starting price at Fireplaces=1 | Expected price keeping all else same and at Fireplaces=2 | Expected price keeping all else same and at Fireplaces=3 |

$100,000 | $104,500 | $109,300 |

$150,000 | $156,800 | $163,900 |

$200,000 | $209,100 | $218,600 |

$300,000 | $313,600 | $327,800 |

If we look at another feature like GarageRatio, the ratio of the garage area to the lot area, then the question is how much does the value of the house change per additional 1% of the lot area the garage composes, given where the price starts. Take a house where the garage area to lot area ratio is 5% and costs $140,000. If the ratio were increased to 10%, the price goes down by a rate of 1.24%, which is the rate of change of 0.249% compounded five times (once for each 1% increase in the ratio), or by a dollar amount of $1,700. Meanwhile a house with the same garage ratio starting at $200,000 would decrease by the same 1.24% rate, but that dollar amount change would be greater at $2,500 this time. One reason a house might be cheaper if its garage area ratio is higher is because a garage takes up a greater portion of the lot that could instead be dedicated to more valuable use of the space, like say a den or another bedroom or outdoor amenities.

Starting price at GarageRatio=5% | Expected price keeping all else same and at GarageRatio=10% |

$140,000 | $138,300 |

$170,000 | $167,900 |

$200,000 | $197,500 |

$250,000 | $246,900 |

$300,000 | $296,300 |

Finally, the coefficients of each neighborhood's dummy variable in the linear model alongside their corresponding change in price relative to the average house in North Ames are shown below a box plot of the house price by neighborhood in Ames. The neighborhoods most similar to North Ames' average sale price include BrkSide (Brookside), Sawyer, Sawyer West and NWAmes (Northwest Ames), which come out to around 0-2% cheaper or more expensive than North Ames. On the other hand, the neighborhoods most expensive compared to North Ames are GrnHill (Green Hills) at 59.4% higher than North Ames, StoneBr (Stone Brook) at 17.6% higher and NridgHt (Northridge Heights) at 15.5% higher; meanwhile the cheapest neighborhoods are MeadowV (Meadow Village) at 18.4% lower than North Ames and BrDale (Briardale) at 14.4% lower.

## Conclusions and Future Work

In conclusion, I have demonstrated in this project that a gradient boost ensemble model optimized through cross-validation is the best regressor for more accurately predicting the sale price of a house in Ames based on multiple features pertaining to quality, condition, area, etc. I have also shown that a descriptive model that is also more explainable can be constructed by using sequential feature selection of a smaller subset of features that carry the most information and fitting them to a transformed sale price feature using linear regression.

The following work could be done in the future to build upon the models developed in this project:

- Investigate effect of outliers (e.g. GrLivArea > 4000 sq. ft) on model accuracy and further understand their presence in the data
- Implement more advanced techniques outside the scope of Scikit-Learn like XGBoost to determine what other features, if any, are emphasized in importance
- Design support vector machine-based regression models that examine how sale price relates to latitude and longitude coordinates of the addresses to establish if there is a relationship to distance from a common location (e.g. Iowa State University)