Ames Iowa Home Sale Prediction
Introduction
Determining the market value of a home can be a challenging task. To the untrained, the process may seem arbitrary. How can anyone consider all the unique aspects of a property to compare them to all others in the local area? Is location truly as critical for home value as people claim it to be? Does a ranch style home with low square footage but modern appliances hold more value than a three story house that has begun to show signs of aging? Which enhanced a property value more: a pool or garage? Anyone who does not hold active interest in the real estate market would likely be lost in the face of such questions. As a result, home buyers and investors often turn to professional assessors to help sort through the mess. But perhaps there is a better way.
The Ames housing dataset is a great tool for understanding how modern machine learning tools can be utilized to predict an accurate market value for any given home. The dataset is comprised of 2580 homes located around the college town of Ames, Iowa. It contains 81 features covering a variety of important factors that can lead to determining the value of a home. These factors include numeric data such as acreage, square footage, and the number of bedrooms/bathrooms. Various ordinal data columns detail the quality of different aspects of the home. These are graded on a 1-5 scale, with 1 being the worst quality and 5 being the best. Additionally the dataset holds categorical data to document information such as the neighborhood the home is located in, or the type of siding used on the exterior of the home. Using the wide variety of available data, this project will endeavor to find an accurate prediction of home values with the help of machine learning.
Data Preprocessing
The first step taken toward interpreting the dataset was to clean the data. The dataset has a number of columns with missing data that needed to be accounted for prior to generating any predictive models.

Figure 1: Count and Percentage of Feature Null Values
Figure 1 showcases 27 of the 81 features containing missing data. A number of different approaches could be used to fill in the missing data. For each case, I applied the form most appropriate to the type of data found in each column.
For the categorical columns, a new categorical response was generated. The MiscFeature column is used to indicate if the home has an additional asset that is not captured elsewhere in the dataset, such as a shed, second garage, or a tennis court. The missing values for this column were given a placeholder: ‘NMF’, or No MiscFeature. This naming convention was held consistent for all other categorical features.
A quick verification confirmed that the ordinal columns that were missing data were simply blank for lack of the feature they are meant to describe. For example, the GarageFinish and GarageQual columns contain the same number of missing values because that number of homes did not possess a garage. There is no way to give a quality score for a garage that doesn’t exist, thus the missing data. Similar to the categorical columns, the convention ‘NX’ was used as a placeholder for the missing values of these columns. As such all of the ordinal garage features had their missing values replaced with ‘NG’.
The last type of feature was the numerical features. Any missing values in numerical columns were either replaced with the average for that column or 0, a decision that was made on a feature to feature basis. 0 was placed for columns such as GarageArea, implying since there was no garage for the home in question, the area would be 0. This strategy would not apply for columns that would intuitively always be non-zero. One such column is LotFrontage, which records the length of property that is connected to the public road. To try to remain as accurate as possible, the average neighborhood LotFrontage values were calculated and imputed into the missing rows.
EDA
After cleaning the data, I created several visualizations to gain a better perspective of the data and its relationship with the home sale price. In order to use Linear Regression techniques properly, the data needed to be verified to meet the assumptions of linearity with its target feature, SalePrice. The assumptions required for proper linear regression are a linear relationship, independence of errors, constant variance of residuals (homoscedasticity), and a normal distribution of residuals.
Figure 2 below is an example of the overall trend with the majority of the numeric features. The subplots in Figure 2 showcase how some of the linearity assumptions are not being met. The first subplot is simply a plot of the feature vs. the sale price of the home. The data may very loosely follow a linear form, but it splays outward as the GrLivArea feature increases. This trend is mirrored in the residual plot in the top right subplot. The conic form of the residuals is indicative of heteroscedasticity. Based on these findings, two of the assumptions have not been upheld. Additionally, the Q-Q plot in the fourth subplot does not follow the target red line very closely. This is an indicator that the residuals are not following a normal distribution, and a third assumption is not met.

Figure 2: Linearity Assumptions Check for Non-Transformed Data
In order to correct the data to comply with the linear regression assumptions, each of the numerical features following the trends discussed for Feature 2 were transformed using the log function. Figure 3 shows the result of the transformation.

Figure 3: Linearity Assumptions Check for Transformed Data
In contrast to Figure 2, the top two subplots show a linear relationship and constant variance of residuals. The Q-Q plot is also far more true to the target line, indicating a more normal distribution of residuals for our transformed features. By performing this simple transformation, the numerical data has become a valid candidate for linear regression models. One last step taken for some of the numerical features was to remove outliers that may still skew the final results. This was accomplished by simply filtering out the few rows that blatantly stood out in any of the transformed data visualizations.
Moving on from the numerical data, we can glean insights by observing the ordinal and categorical data, namely a preview of feature importance. Figures 3 and 4 provide examples of how this data can be interpreted prior to model generation as potentially impactful.

Figure 4: Overall Home Quality vs Sale Price

Figure 5: Home Foundation Type vs Sale Price
In Figure 4 a clear relationship can be seen between the overall quality rating of the home and the sale price. This is expected, as a home of finer quality would be expected to hold more value. The strong relationship leads to the expectation that OverallQual will end up having high feature importance in the predictive models. For the categorical data, some columns exhibit signs of diverging trends for their data. Figure 5 showcases this, showing that some foundation types have a marked advantage over others in improving the sale price of the home. Poured concrete is the clear winner in the Foundation game, with the average price of homes in that bucket outperforming even the top three quarters of the next best foundation types. Categorical columns such as Foundation that have clear winners and losers in the sale price game should be useful indicators for the predictive models to come.
Predictive Modeling
Prior to analyzing the categorical and ordinal data, I generated a linear regression model using only the transformed numerical data. To ensure no multicollinearity amongst the columns selected, I generated a correlation heatmap.

Figure 6: Correlation Heatmap of Numerical Features
From Figure 6, the only columns found to be too highly correlated were YearBuilt and YearRemodAdd. A quick background investigation shows that homes that had no remodel copied in the build year as a placeholder. As such, many rows of the YearRemodAdd column are a duplicate of YearBuilt, a clear sign of multicollinearity. YearRemodAdd was removed from the dataset and an initial pass at a multiple linear regression model was generated. Unfortunately, this model yielded poor results with a final correlation score of only 77%. The numeric columns alone will not yield accurate predictive models for the home sale price.
Learning from the first attempted model, the next model was created using all features. Dummy variables were created to transform the categorical data into a usable numeric form. From there, several models were run on the subsequent transformed dataset, with the results shown in Table 1. I used three linear regression models as well as two tree models were used to find the best possible predictive result for this dataset: multiple linear regression, ridge regression, lasso regression, a decision tree, and a random forest model. For each model, I employed a train-test split and five-fold cross-validation to showcase whether or not the model was victim to overfitting. Additionally, for the ridge and lasso models, I tested multiple alpha values to find the optimal tuning parameter.

Table 1: Results of Pass 1 Models
The results in Table 1 show that the ridge and lasso models are the best performing models when the entire transformed dataset is used. They have the highest test R2 scores while enjoying a much smaller gap in train-test R2 compared to the other models. This is a good sign for these two models, showing that the tuning parameter is doing its job of limiting overfitting. Both the tree models performed poorly with the entire dataset, as they suffered from substantial overfitting.
While the ridge and lasso models of Pass 1 were decent predictive models, they still each had over a 1% gap between the R2 score of the train and test data. Passes 2 and 3 attempted to reduce this gap by reducing the size of the dataset. After Pass 1 was completed, a feature importance list was generated from the random forest model, as shown in Figure 7.

Figure 7: Pass 1 Random Forest Feature Importance
OverallQual is by far the most significant indicator of Sale Price for homes in this dataset, aligning with findings from the EDA portion of the project. The next several features are all related to square footage, suggesting the size of the home is also highly indicative of higher value. Before running the Pass 2 models using this feature importance list, a correlation heat map was used to ensure the square footage features were not overly correlated. Logically many homes, particularly with styles like colonial homes, could have highly correlated ground and 2nd floor square footage. Figure 8 assuages these concerns, proving no columns have multicollinearity too high for the purpose of this project (cutoff for multicollinearity chosen to be 0.75).

Figure 8: Pass 2 Feature Correlation Heat Map
Pass 3 of the machine learning models used a similar criteria to Pass 2. Instead of using the feature importance list, the features showcasing the highest correlation with Sale Price were used. The same steps were taken to ensure no multicollinearity between these columns.
Table 2 showcases the results from Passes 2 and 3 compared with Pass 1.

Table 2: Train and Test R2 Scores for Passes 1-3
From Table 2 it can be seen there are several options for what could be considered the best model. The highest test R2 score can be found from the Pass 1 ridge regression model with a score of 0.931. However the second pass led to much better accuracy between the train and test scores. In Pass 2, all three linear regression models performed about the same, with R2 scores just south of 0.9. The observed gap in the second pass is actually negative for the linear models, suggesting there is no evidence of overfitting or bias whatsoever in these models. Depending on which is valued more, higher test R2 score or minimizing the gap between train and test R2 score, any of these highly performing models could be used as the final model for this dataset.
Conclusion and Future Work
The models generated for this dataset were able to predict the final sale price for each home with high accuracy, and the low gap between the train and test data for the models provides certainty that the models are not being exposed to a high level of overfitting. With these findings, we can determine the top performing models from this project would be effective tools for home buyers or sellers to analyze the residential property market in Ames, Iowa. The feature importance list also paints a picture of what features are most impactful for increasing the value of a home. Renovators looking to flip a home would find this information particularly helpful. Knowing that the finish quality of a home is more valuable than adding an attached garage would help the renovators prioritize their budget to maximize the return on their overall investment.
Using the predictive models, further analysis can be done on the data. The next steps to be taken for this project would be to analyze the predicted vs real sale price to find which homes were over or undervalued, and what factors may be causing that. For example, if a neighborhood consistently has homes showing as undervalued, perhaps there are negative aspects of said neighborhood not captured in the dataset, such as a higher crime rate or poor access to public services. Additionally, the models themselves could be improved by further feature transformation or generation. As more insights about the dataset are discovered, an iterative approach to shaping the dataset before model generation could be created to further increase the accuracy score of the models while maintaining a low level of overfitting.