A Machine Learning Approach for Determining Valuable Housing Features

Brian Perez Joseph
Posted on Feb 20, 2021

Acting as real-estate advisors, the central objective of the machine learning project was to provide insights for investors and home renovators on how to improve profits in the real-estate market in Ames, Iowa. To that end, a descriptive multiple linear regression model was built to analyze house features with respect to the sale price to understand how certain features influence the overall price. The Ames housing dataset contains entries on approximately 2500 sold houses between the years 2006 and 2010. The dataset contains over 80 features, detailing information on house characteristics such as square footage, year of construction, and basement quality.

Pre-Processing of the Data

The dataset had 12,254 missing values, so prior to analysis, these missing values were imputed. Most of the missing values were associated with the absence of a particular housing feature; therefore, the general scheme for imputation was to replace "NA" values with either a zero, if the feature was numeric, or "none" if the feature was categorical.

Additionally, several more features were generated for the purposes of integrating more useful features into the model:

• Neighborhood Groupings: Neighborhoods were grouped by the median sale price in to five distinct groups ranging from 1 to 5

• Area Ratio: Ratio of living space to the total lot space

• Remodeled: Indicated if a house was remodeled since its construction

In order to construct the linear regression model, the categorical features needed to be converted to a numeric format. Therefore, categorical features were converted to factors ranging from 1 to 5. Furthermore, features such as sale price and square footage were transformed on a logarithmic scale in order to improve the linearity of these features and scale down their large values. Lastly, any sale prices that were beyond three standard deviations from the mean were not included in the model.

Model Training & Selection

After pre-processing, the data were split into a training set (80% percent of the data) and test set (20% percent of the data). Aside from the target feature sale price, all of the other features were kept and put through a stepwise regression using the Bayesian Information Criterion to select for the simplest multiple linear regression model among all possible models. Following the selection, the model was evaluated using the test set to confirm the validity of the results, and the residuals from the model were used to evaluate the performance of the model.

The final model produced from the selection process contained the following housing features: Basement Quality , 1st Floor Square Footage, 2nd Floor Square Footage , Exterior Quality , Exterior Condition, Garage Quality, Kitchen Quality, Neighborhood Group , and Heating Quality. The adjusted R-squared value of the model was 0.8427.

Model Diagnostics: Evaluation of Residuals

Figure 1: Scatterplot of the the final model's fitted values plotted against the model's residuals

Figure 1 shows a scatterplot of the model's fitted values plotted alongside the residuals. The plot evaluates whether the predicted values exhibit any non-linear trends, violating the assumptions of linearity and bringing the results of the model into question. However, from the model, there is no trend regarding the spread of residuals, proving that the assumption of constant variance is upheld in the model.

Figure 2: QQ-plot of the residuals of the final linear model

Figure 2 shows the normal QQ-plot of the residuals, indicating whether the residuals follow a normal distribution based on the distance from the dashed line. From the plot, most of the points are within close range of the dashed line except near the tail-ends. Although several points do significantly deviate from their theoretical values, because the model was built for descriptive purposes, this deviation was deemed acceptable for the final model.

Figure 3: Leverage plot of final model. The leverage values are plotted against the standardized residuals

Figure 3 shows the leverage plot of the final model, showing whether there are any points that have a high influence on the model. Although there are some outliers, none of these points go beyond the Cook's distance, indicating that they do significantly affect the overall performance of the final model.

From the analysis of the residuals, the linear model upholds the assumptions of linear regression, meaning the model can now be interpreted to understand the value of each of the housing features.

Interpretation of the Model & Recommendations for Home Renovations

By going through model selection process, the final model contained features that were most influential in predicting sale price, meaning that these features had the most influence on the overall value of a house. Given that the model predicted the logarithmically scaled price, the importance of a housing feature was interpreted by examining the magnitude of its coefficient. Since the target feature was on a logarithmic scale, small changes could result in large differences once the result is exponentiated, so the coefficient values acted as a strong indicator of the impact a particular feature had on the sale price. From the model, the feature that had largest coefficient among all other features was first floor square footage with a value of 0.55. Conversely, the second floor square footage had the lowest coefficient among all features with a value of 0.03. From this information, the Ames real-estate market appears to favor houses that prioritize first-floor development such as ranch-style dwellings rather than multi-level dwellings. From the builder's perspective, investing in first-floor development might be a more secure way of improving profit in the Ames real-estate market; second floor development yields a comparatively smaller impact on the overall sale price and in some instances may not be worth the investment.

Garage quality and basement quality were another set of features that had a more unique effect on the sale price in that these features could potentially decrease the overall price. Basement quality referred to the height of the basement and was grouped by five distinct height ranges. When the the height of the basement was below seventy inches, the log-scaled price decreased by 0.14. Garage quality also showed a similar trend; garage quality was graded on a scale ranging from poor to excellent, and at the lowest grade, the log-scaled price decreased by 0.23. Essentially, having a low quality basement or garage is actually detrimental towards the overall price; therefore, builders should avoid constructing these features unless they have the capacity to build them at a higher height range or grade.

The housing exterior was the next feature that had a sizeable impact on the potential sale price. The housing exterior was characterized by two columns: exterior quality and exterior condition. The exterior quality refers to the quality of material on the exterior of the house while the exterior condition refers to present condition of the material; both were on a scale ranging from poor to excellent. At the highest grade, the exterior condition feature had a coefficient of 0.47; meaning that in comparison to the lowest grade, there was 0.47 increase in the log-scaled price when the condition was excellent. For exterior quality, the model showed that there was a 0.33 increase in the log-scaled price when the quality was excellent. For both features, the highest increase in log-scaled price was associated with the excellent grade, highlighting the importance of curbside appeal in the marketability of a house. Therefore, a recommendation for home renovators/flippers seeking to turn a profit would be to invest in improvements towards the exterior of the house.

Kitchen quality is one of the last features derived from the model that was found to have a significant impact on sale price. Similar to previous features, kitchen quality was assessed on a grading scale from poor to excellent, and at the highest grade, the log-scaled price improved by 0.44. The relatively high improvement highlights the kitchen as another viable area for home improvement; renovating this area may help to increase the overall sale price of a house.

Summary of Potential House Improvement

The multiple linear regression analysis yielded numerous insights on how to improve potential profits on the Ames real-estate market. For home builders, maximizing first floor square footage would be a solid approach for seeing the most significant improvement in price. The potential trade-off in investing in additional levels may not yield as much profit if development were focused on only the first floor. Furthermore, the construction of subpar basements and garages should be avoided seeing as how they decrease the overall value of a house. However, seeking out houses with these low quality features may yield major profits for home flippers. By seeking houses with low grade housing exteriors, garages, or kitchens and renovating these areas, home flippers have the potential to significantly improve the price of a house and make a potential profit of a home.

About Author

Brian Perez Joseph

Brian Perez Joseph

With a background in biomedical research and data science, Brian aims to utilize his quantitative background in the sciences and data programming skills to provide data-driven decision making strategies and key insights for real-world business problems.
View all posts by Brian Perez Joseph >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp