Data Analysis on the Valuable Features in Houses
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
When it comes time to sell a home, everyone hopes to sell for as high as possible. While many aspects of a home contribute to its overall value, homeowners are unable to know which aspects reign more important than others. While certain factors are out of the control of owners, such as neighborhood, there are improvements and renovations they can make to potentially increase their value. But which features should be prioritized and are worth the costs? Our project sets out to use data to answer this.
Through Kaggle, we acquired data on the home sale prices of homes in Ames, Iowa including over 80 features specific to each house. Through the use of machine learning algorithms paired with EDA and feature engineering, we were able to create two models: one to identify features with the most impact on value and one model to predict value based on the previous model’s selected features.
From our analysis, we found that the most important features are the overall quality, number of fireplaces and full bathrooms, quality of the kitchen, and number of basement full bathrooms. Our final prediction model had an R-squared value of 0.805.
Upon first glance at the dataset, there was an overwhelming amount of NA’s within many of the categorical features. With a closer look at the documentation, we see that each NA describes the lack of a feature. Thus we to improve interpretability of the data, we have imputed string values for those homes with NA under a feature.
Feature Selection Data
With 36 categorical variables, we noticed that not all features within a category were significant enough to include in our model. In addition, to dummify all features would introduce a large amount of scarcity and increase the total computing power needed.
To avoid this, we analyzed the mean, medians, and distributions of each categories feature through the use of box plots. Those with IQRs and medians occupying a similar space would be considered for grouping into one feature. However, box plots do not visualize the count of homes with a certain categorical. We categorized the home prices into bins to create bar charts to visualize this. While a certain feature might occupy a significantly different distribution space and median, the bar chart will show us if the count is large enough for it to stand alone or to be grouped.
With over 80 features of a home documented, there was bound to be multicollinearity among its features. To combat this issue we regressed each numerical category against those remaining. Regressions with a large R-squared indicate that a heavy presence of collinearity and was used as the basis to drop or retain certain numerical values.
For our final step of feature selection, we ran a lasso regression on the remaining features. By using a grid search on with a C value iterating from 1 to 1000. By using lasso regression, it assists with the issue of multicollinearity that remains. As the C value on the regression increases, the coefficients of the less significant features converge to 0. We continue to examine our lasso regressions until we find a C value when there only remains 4 upgradable features left.
Modeling Data Selection
While there are many models to choose from, our main focus was the dollar amount interpretability of a model’s output. Certain models actually performed better in terms of R-squared however, their outputs could only be interpreted in terms of relativity importance. As a result, we decided on using the linear regression model predict homes sale values.
For our categorical features that required dummy variables (kitchen quality and basement finish type), we drop the low-end features. Thus, they are now used as the baseline for our model.
From our coefficients we see that kitchen quality has the potential to add the most value to a home, followed by number of full bathrooms, number of basement full baths, and lastly by the basement finishing type.
In order for the renovation to be worth it, the value added must exceed the renovation costs. We estimated these costs using average prices quoted specifically to Iowa. For quality variables such as the kitchen and basement finish type, the cost was estimated based on how much renovation would most likely need to be done to move from low-end to mid and the to high.
From our graph you can see that an excellent kitchen offers the most return on investment followed by upgrading to a good kitchen. While the net estimated return for a full bath might be positive based on our results, however due to the narrow margin, more inspection on the specific home in question should be done. Unfortunately, a renovated basement is not worth the costs.