Data Analysis on Valuable Housing Features
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Acting as real-estate advisors, the central objective of the machine learning project was to provide data insights for investors and home renovators on how to improve profits in the real-estate market in Ames, Iowa. To that end, a descriptive multiple linear regression model was built to analyze house features with respect to the sale price to understand how certain features influence the overall price.
The Ames housing dataset contains entries on approximately 2500 sold houses between the years 2006 and 2010. It contains over 80 features, detailing information on house characteristics such as square footage, year of construction, and basement quality.
Pre-Processing of the Data
The dataset had 12,254 missing values, so prior to analysis, these missing values were imputed. Most of the missing values were associated with the absence of a particular housing feature; therefore, the general scheme for imputation was to replace "NA" values with either a zero, if the feature was numeric, or "none" if the feature was categorical.
Additionally, several more features were generated for the purposes of integrating more useful features into the model:
• Neighborhood Groupings: Neighborhoods were grouped by the median sale price in to five distinct groups ranging from 1 to 5
• Area Ratio: Ratio of living space to the total lot space
• Remodeled: Indicated if a house was remodeled since its construction
In order to construct the linear regression model, the categorical features needed to be converted to a numeric format. Therefore, categorical features were converted to factors ranging from 1 to 5. Furthermore, features such as sale price and square footage were transformed on a logarithmic scale in order to improve the linearity of these features and scale down their large values. Lastly, any sale prices that were beyond three standard deviations from the mean were not included in the model.
Model Training & Data Selection
After pre-processing, the data were split into a training set (80% percent of the data) and test set (20% percent of the data). Aside from the target feature sale price, all of the other features were kept and put through a stepwise regression using the Bayesian Information Criterion to select for the simplest multiple linear regression model among all possible models. Following the selection, the model was evaluated using the test set to confirm the validity of the results, and the residuals from the model were used to evaluate the performance of the model.
The final model produced from the selection process contained the following housing features: Basement Quality , 1st Floor Square Footage, 2nd Floor Square Footage , Exterior Quality , Exterior Condition, Garage Quality, Kitchen Quality, Neighborhood Group , and Heating Quality. The adjusted R-squared value of the model was 0.8427.
Model Diagnostics: Evaluation of Residuals
Figure 1: Scatterplot of the the final model's fitted values plotted against the model's residuals
Figure 1 shows a scatterplot of the model's fitted values plotted alongside the residuals. The plot evaluates whether the predicted values exhibit any non-linear trends, violating the assumptions of linearity and bringing the results of the model into question. However, from the model, there is no trend regarding the spread of residuals, proving that the assumption of constant variance is upheld in the model.
Figure 2: QQ-plot of the residuals of the final linear model
Figure 2 shows the normal QQ-plot of the residuals, indicating whether the residuals follow a normal distribution based on the distance from the dashed line. From the plot, most of the points are within close range of the dashed line except near the tail-ends. Although several points do significantly deviate from their theoretical values, because the model was built for descriptive purposes, this deviation was deemed acceptable for the final model.
Figure 3: Leverage plot of final model. The leverage values are plotted against the standardized residuals
Figure 3 shows the leverage plot of the final model, showing whether there are any points that have a high influence on the model. Although there are some outliers, none of these points go beyond the Cook's distance, indicating that they do significantly affect the overall performance of the final model.
From the analysis of the residuals, the linear model upholds the assumptions of linear regression, meaning the model can now be interpreted to understand the value of each of the housing features.
Interpretation of the Model & Recommendations for Home Renovations
By going through model selection process, the final model contained features that were most influential in predicting sale price, meaning that these features had the most influence on the overall value of a house. Given that the model predicted the logarithmically scaled price, the importance of a housing feature was interpreted by examining the magnitude of its coefficient.
Since the target feature was on a logarithmic scale, small changes could result in large differences once the result is exponentiated, so the coefficient values acted as a strong indicator of the impact a particular feature had on the sale price. From the model, the feature that had largest coefficient among all other features was first floor square footage with a value of 0.55.
First and Second Floors
Conversely, the second floor square footage had the lowest coefficient among all features with a value of 0.03. From this information, the Ames real-estate market appears to favor houses that prioritize first-floor development such as ranch-style dwellings rather than multi-level dwellings. From the builder's perspective, investing in first-floor development might be a more secure way of improving profit in the Ames real-estate market; second floor development yields a comparatively smaller impact on the overall sale price and in some instances may not be worth the investment.
Garage and Basement
Garage quality and basement quality were another set of features that had a more unique effect on the sale price in that these features could potentially decrease the overall price. Basement quality referred to the height of the basement and was grouped by five distinct height ranges.
When the the height of the basement was below seventy inches, the log-scaled price decreased by 0.14. Garage quality also showed a similar trend; garage quality was graded on a scale ranging from poor to excellent, and at the lowest grade, the log-scaled price decreased by 0.23. Essentially, having a low quality basement or garage is actually detrimental towards the overall price; therefore, builders should avoid constructing these features unless they have the capacity to build them at a higher height range or grade.
The housing exterior was the next feature that had a sizeable impact on the potential sale price. It was characterized by two columns: exterior quality and exterior condition.
Exterior Quality
The exterior quality refers to the quality of material on the exterior of the house while the exterior condition refers to present condition of the material; both were on a scale ranging from poor to excellent. At the highest grade, the exterior condition feature had a coefficient of 0.47; meaning that in comparison to the lowest grade, there was 0.47 increase in the log-scaled price when the condition was excellent. For exterior quality, the model showed that there was a 0.33 increase in the log-scaled price when the quality was excellent.
For both features, the highest increase in log-scaled price was associated with the excellent grade, highlighting the importance of curbside appeal in the marketability of a house. Therefore, a recommendation for home renovators/flippers seeking to turn a profit would be to invest in improvements towards the exterior of the house.
Kitchen quality is one of the last features derived from the model that was found to have a significant impact on sale price. Similar to previous features, kitchen quality was assessed on a grading scale from poor to excellent, and at the highest grade, the log-scaled price improved by 0.44. The relatively high improvement highlights the kitchen as another viable area for home improvement; renovating this area may help to increase the overall sale price of a house.
Summary of Potential House Improvement
The multiple linear regression analysis yielded numerous insights on how to improve potential profits on the Ames real-estate market. For home builders, maximizing first floor square footage would be a solid approach for seeing the most significant improvement in price. The potential trade-off in investing in additional levels may not yield as much profit if development were focused on only the first floor.
Furthermore, the construction of subpar basements and garages should be avoided seeing as how they decrease the overall value of a house. However, seeking out houses with these low quality features may yield major profits for home flippers. By seeking houses with low grade housing exteriors, garages, or kitchens and renovating these areas, home flippers have the potential to significantly improve the price of a house and make a potential profit of a home.