Using Data to Analyze Home Improvement Recommendations
While the types of homes in demand differ widely across regional real estate markets, important features that increase a property’s desirability and value are visible in every market. By investing in and prioritizing certain home improvements over others, sellers can maximize the return on the money they spend renovating their house. In this text we will use data to analyze home improvement recommendations.
This project seeks to investigate some of those more influential predictive factors affecting housing prices, specifically in the Ames housing dataset. Besides uncovering those predictive parameters for housing prices, the primary research question posed in this project more broadly investigates the profile of prospective homebuyers at a regional scope. That is, what are some general buying patterns and preferences of buyers in Ames? Employing a variety of predictive models, this project presents one possible set of answers to these questions.
This project used the Ames Housing dataset, sourced from Kaggle. Python was used for data preprocessing, analysis, and visualization. The training dataset included 80 variables and 1460 observations. Packages used include pandas, numpy, and sklearn. Visualizations were made with the Python packages matplotlib and seaborn.
Feature engineering in this project involved creating one additional dependent variable, which was used as the target variable. The original sale price column was cut and encoded by percentile, such that the integer ‘3’ represented the 76th – 100th percentile, ‘2’ represented the 51st – 75th percentile, ‘1’ represented the 26th – 50th percentile, and ‘0’ represented the bottom 25th percentile of house prices.
An additional feature created and ultimately discarded involved splitting the sale prices into the 0th to 50th and 51st to 100th percentile ranges.
In other preprocessing, imputation for missing values in this dataset involved substituting NA values with their string counterparts (as many values were coded with NAs but were meant to signify that the feature was absent in the house, rather than the absence of the value itself). While the total amount of values in the dataset actually missing was low, a nonsensical outlier value (-999 for numerical columns and ‘Missing’ for string columns) was substituted in when such an instance occurred.
Predictive models used in this project include random forest classification, gradient boost classification, support vector classification, and principal component analysis classification combined with logistic regression. The train test split was 80/20, and stratified k-fold cross-validation was used during grid search cross-validation of the supervised predictive models.
As an aside, because of memory limitations, randomized grid search instead of grid search cross-validation was strategically employed in order to tune and find suitable hyperparameters for the gradient boosted classifier. This did not appear to significantly detract from the model’s performance, as discussed in the Results section.
Overall predictive accuracy for all models hovered around 80% on the test data partition of the training dataset, a fact reflected by multiclass confusion matrices and classification reports demonstrating corresponding recall and precision score for each class across models. Notably, all models seemed to struggle more in predicting the ‘1’ class (the 26th to 50th percentile of house prices), and to a lesser extent the ‘2’ class (the 51st to 75th percentile of prices).
Acknowledging those performance metrics, then, both the random forest and gradient boosted classifiers seemed to agree that the top two most informative predictive features for a house’s pricing were the overall quality of the house and the above grade (i.e., above ground) living area square feet.
The other three features from the top five predictive parameters in the random forest model were year built, first floor square footage, and garage area (in descending importance), while the gradient boosted classifier posited first floor square footage, garage area, and total basement area (in descending predictive importance).
Fig. 1: Random forest versus gradient boosting classifier variable importance results.
Following up on some of those features uncovered by the supervised models, the special importance of a house’s overall quality versus its overall condition seemed superficially surprising. However, grouping overall house quality and condition by their rating categories and the mean sale price associated with each rating showed that the average price of a house rated ’10’ (the maximum rating) on overall quality was more than double the price of a house rated ‘9’ for overall condition (the highest rating earned in the dataset was ‘9’, though the maximum possible was ‘10’).
This 20,000-dollar average sale price difference suggests that, when faced with the choice, home sellers in Ames looking to boost their home value should consider investing in quality materials when renovating their homes over repairing or maintaining lower quality features of their home.
Fig. 2: Pricing trends for increased overall house quality versus condition.
Then, visualizing the relationship between the above grade square footage of a home against sale price, this demonstrated a relatively robust, positive and linear relationship. While not necessarily an actionable item, this finding suggests that sellers should prioritize above ground property renovation projects that would ideally increase or otherwise highlight and improve living space available. This seems to fall in line with the predictive value of first floor square footage in pricing, as shown by both of the supervised models.
Interestingly, garage area was also another feature informing house value cited by both of the supervised models. Either as another storage area or as both storage space and an area for protecting and storing vehicle(s), this would be a major amenity especially for buyers who were looking at higher-priced homes and thus presumably also had the income to purchase more than one car (or wanted to protect the condition of and investment in their vehicle).
In fact, all houses in the highest pricing quartile had a garage, while houses that did not are represented by '0' values on the graph below. Thus, investing in a garage or expanding an existing one could possibly also help some sellers improve the value of their home.
Fig. 3: Garage area by pricing quartile.
Finally, examining some overall trends in the data, the Ames housing market adhered to some more conventional real estate wisdom. For example, generally only newer, more recently built homes sold in the highest pricing quartile. This suggests that most buyers willing to pay more for a home were interested in ‘turnkey’ quality properties, or houses that required little or no renovation. Lot area also had a generally positive, linear relationship with house price.
Additionally, the real estate market seemed to pick up around April to May and peak in June and July. Besides being more ideal weather for house touring and moving, certain demographics like families were likely trying to move in time to enroll their children in the upcoming school year.
Ultimately, in response to the research questions, Ames house sellers should focus on improving both their overall home quality (over maintaining or repairing lower quality aspects of their home) and maximizing or otherwise highlighting the above ground living space available in their homes in order to increase their home value. Garage expansion was also seen as a possible avenue for increasing home value.
In response to the research question of other buying patterns in Ames house sales, the data suggests that a higher volume of sales occurs in mid-Spring and peaks in the summer before sharply falling off. Additionally, buyers who bought houses in the highest pricing quartile generally only did so for more modern, recently built houses.
Future research might examine how feature engineering or model selection could be used to improve the overall predictive precision and recall of models and thus applicability of the analysis’ findings. For example, as mentioned, all models seemed to struggle more in terms of precision and/or recall with the ‘1’ class of housing prices, and to a lesser extent the ‘2’ class.
A future unsupervised predictive model like cluster analysis might reveal that there are possibly only three major pricing groups in the data (i.e., the bottom 25% of house prices, the 26th to 80th percentile of house prices, and the top 20% of house prices).
Depending on the results, running supervised models with this new three class feature might reveal different, more powerful and accurate features predicting house price. Finally, as mentioned, because of memory limitations the gradient boosted model was not able to run a full grid search cross-validation process for hyperparameter tuning. XGBoost might be useful in this regard, and should be looked into for future, faster, more computationally efficient tree-based modeling.
Thus, in investigating the patterns and trends in this snapshot of the real estate market in Ames, sellers can focus on and prioritize certain features of their homes in order to maximize the return on value they gain from house renovations. While not necessarily always directly actionable, the predictive models employed in this project illuminate the home features buyers pay attention to across pricing quartiles in the Ames housing market, offering sellers across pricing quartiles potential solutions to maximize their net home sale profit.
Link to code: Github repository
House Prices: Advanced Regression Techniques. Sept, 2020. https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.