Home Sales The Smart Choice: Predicting Housing Prices
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction
The scope of this project was to predict housing prices for houses sold in Ames, IA between 2006 and 2010. The data consisted of approximately 80 housing characteristics and 1460 observations. This was a collaborative project with two additional teammates.
Together, we decided to utilize linear regression to maintain model interpretability. Our objective was to provide data-informed insights to realtors and prospective home buyers and home sellers in Ames, IA. Beyond predicting housing prices our team analyzed important features that impact housing prices. This can give our target audience useful information when trying to buy or sell homes in the area.
Throughout this post I will discuss a brief background about Ames, our exploratory analysis, feature selection process, and model results. Finally, I will conclude with ideas for future work on this project.
Home Background
In 2010, according to the United States Census Bureau (2010), Ames, IA had a population of approximately 59,000 individuals. The median population age was 23.8 and about 29% of people were aged 20 to 24. Most people identified as white (84.5%), followed by Asian (8.8%), Black or African American (3.4%), or other (3.3%). Additionally, there were about 23,000 households with an average household size of 2.25 and about 24,000 total housing units.
Ames’ motto is the Smart Choice (City of Ames, 2020). There are several factors that make Ames the smart choice when choosing to live somewhere. According to CNN Money (2008 & 2010) Ames had low unemployment, low personal and property crimes, low commute times, and great air quality. These, among other factors, led the magazine to rank Ames as one of the top 100 best small cities to live in for its 2008 and 2010 publications. The largest employer in the city is Iowa State University which employed approximately 20.4% of total employment. This could be the reason for 29% of people being aged 20 to 24, indicating that people are attending the university for their education.
Home Analysis
Going into the project, in order to provide easily interpretable results to our clients, linear models were chosen to predict housing prices and for feature importance. Consequently, one major assumption of linear models assumes that the target variable is normally distributed. However, when examining the distribution of home sale price there is a clear right skew. In order to correct this, the log of sale price was taken. The log price helps to control the fanning seen in sale prices beyond the average price. What this means is that model predictive accuracy will be increased by utilizing the log sale price.
Due to a large number of housing features, an exhaustive discussion of all features will not be provided here. Instead, I will focus on a select few features for the exploratory analysis, feature selection, and feature engineering section. Another important assumption with linear models is to reduce multicollinearity. With a large number of features there is a chance that multiple features are correlated which will negatively impact model performance.
There were three variables that encompassed square footage for the home which were first floor square feet, second floor square feet, and basement square feet. The housing square footage variables were one example of multicollinearity in the data set. Consequently, to satisfy this assumption of linear models the three home square footage variables were combined into a single square footage variable: total square feet.
When examining the scatterplot of the total square feet compared to log sale price a clear linear trend is observed, however, there appears to be a few outliers with high square footage but low sale price. Linear regression is sensitive to outliers because outliers are not captured in the general trend which the regression tries to capture.
Examining Outliers in Home Data
Outliers were examined using Cook’s distance in our features and determinations on removing outliers were made as a group. It is important to note that with a larger data set outliers could be excluded from model training without sacrificing valuable information from those observations. However, with a smaller number of observations this has to be considered more carefully because you could cause your model to overfit by getting rid of too many observations.
Additional feature selection and feature engineering relied on examination of variable correlations, forward and backward selection examined by AIC or BIC, and lasso regression. However, to examine feature selection importance using penalization, variables had to turned into a format the model could accept. This included making some variables binary, combining other features to reduce multicollinearity, reducing outliers in accordance to Cook’s distance, or utilizing one hot encoding methods.
Some notable important features resulting from the lasso penalization included:
- overall quality,
- total square feet, and
- total baths
Ultimately, this helped reduce our feature space to 21 features instead of 80 (79 if sale price is excluded). The reduction in feature space helps to ensure the model does not overfit to the training data. Overfitting can be a major problem when testing the model on unseen data. The result of overfitting a model is that it performs poorly on testing and unseen data.
Home Model Results
After preprocessing the data it was time to fit the model to the training data. As mentioned previously, linear regression was chosen to maintain ease of model interpretability. Scikit-learn’s linear regression was used to fit and test the model. We split the training data into train and test sets using 80% of the data for training and 20% of the data for the test set. Initial results when fitting the model to the split training data, a test R squared of 89.4% was observed. In other words, the model accounted for 89% of the target sale price.
Furthermore, when utilizing the statsmodels regression package there were a few variables that did not support inclusion in the final model. The three variables that were removed were bedrooms above grade, home age, and lot frontage. Reasons to exclude these three variables were based on their t-test results when compared to the sale price with a p-value greater than 5%. The final model was calibrated after dropping these three features. The final training and test score were both about 89% indicating the model is not overfit.
Finally, to confirm the assumptions of our model there are a few visualizations provided below. A quantile-quantile (Q-Q) plot was utilized to confirm that residuals were normally distributed. Prediction errors were analyzed to confirm that residuals were distributed evenly and that linearity existed between the model and the sale price. Finally, there does not appear to be any discernible pattern in the residuals to indicate that errors are not independent.
VIF
Additionally, the variance inflation factor (VIF) was examined to confirm that multicollinearity did not exist between the remaining features. It is important to note that generally a VIF of five or less is accepted as no multicollinearity. All features had a VIF of less than five.
Finally, important features indicated by the model that have positive impact on sale price were house square footage, overall home quality, the number of high quality features in the home, and bathrooms. High quality features were influenced by kitchen and basement finish quality. Features that impacted sale price negatively were lack of air conditioning and home age. These are features that home owners and buyers would want to consider when negotiating sale price and home value.
Conclusion
To summarize, Ames, the smart choice city, is a great place to live due to several social and economic factors which helped obtain a top 100 best places to live designation. In an effort to provide data-informed insights to realtors, home buyers, and home sellers, housing sales were analyzed between 2006 and 2010. Linear regression was chosen to maintain model interpretability for the target audience. The model suggested important features that impact sale price included home square footage, home quality, and bathrooms to name a few.
Future Work on Home Prices
Future work for this project includes additional feature selection and feature engineering. One thing to consider in the neighborhood could be the median income. This data would have to be accounted for across different years and merged with the existing data. Non-linear models could also be explored to see if they can provide greater insight into important features or greater performance. The trade-off with other models is increased complexity which would have to be considered when discussing model performance with the target audience.
References
City of Ames. (2010). Comprehensive annual financial report: For year ending June 30, 2010. Retrieved from https://www.cityofames.org/Home/ShowDocument?id=11004
City of Ames. (2020). About us. Retrieved from https://www.cityofames.org/about-ames/about-ames
CNN Money. (2008). Best places to live: Money’s list of America’s best small cities. Retrieved from https://money.cnn.com/magazines/moneymag/bplive/2008/states/IA.html.
CNN Money. (2010). Best places to live: Money’s list of America’s best small cities. Retrieved from https://money.cnn.com/magazines/moneymag/bplive/2010/snapshots/PL1901855.html.
United States Census Bureau. (2010). American fact finder. Retrieved from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=C