Data Analysis in Predictive Modeling to Enhance Home Profit
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Thank you for taking the time to read our research! Please feel free to use the links below to explore our code on GitHub.
The housing data of Ames, Iowa challenges budding Data Scientists in many aspects through data wrangling, preparation, and predictive modeling. It proved itself as a popular dataset in the data science community, providing an excellent opportunity to test out machine learning models on feature-rich information. For every observation of a completed sale, it contained 80 home features associated with the house that was sold. The success of our project were determined by three objectives:
- Create machine learning models that would accurately predict a home's sale price in Ames
- Perform Exploratory Data Analysis to understand the housing market in Ames
- Assume the role of a home-improvement advisor. Through information gathered from our EDA and results of our machine-learning models, we would be able to provide recommendations to homeowners on key home features to improve in order to have the greatest positive impact to their home value
The dataset consisted of quantitative, ordinal, and categorical variables. Prior to running the dataset through our models:
- We log-transformed the "SalePrice", "GrLivArea," and "LotArea" variables. These were all very important variables, under our presumable understanding that "GrLivArea" and "LotArea" would be key drivers of our target value, "SalePrice." Most importantly--during our EDA--we found that their distributions were all right-skewed. Log-transformation was a quick solution to address the skewed data.
- In order to address the missing data, we use our best judgment to fill the NAs with the mean or zeros. For most features, missing data was typically a result of not having that feature. For example, the size of the pool was a variable that a majority of homes were missing. This was due to the home not having a pool.
- We took the time to label-encode or dummify many of the the categorical and ordinal features. The dummification was performed in order to apply linear regression models; it was not required for the random forest model.
Lastly, we removed houses that had GrLivAreas over 4000 square foot. When plotting these features, these four homes were clear outliers that justified their removal.
Data Model Summary
Lasso and Stepwise Model
Our highest performing models were the Lasso model (regularized linear regression) and Stepwise model (regression model that iteratively adds or removes features to achieve the best performance). Both of these models satisfied our stated objectives of creating accurate predictive models. However, we also needed assistance in narrowing down tangible features to offer our clients as home improvement advisors.
Our Lasso model, had a very small alpha, thus returning too many features as significant (non-0). When we raised the alpha, this reduced the R-squared output of the model while completing the crucial task of narrowing the number of significant features down to 20.
Having reduced the number of features to a more manageable and interpretable quantity, we applied the OLS (ordinary least squares) regression model.
This was a critical step for us as home improvement consultants because through this model, along with our aforementioned log-transformed "SalePrice" target variable, we could now equate changes in home features to actual percentage changes to their home price. As an example, we could now tell homeowners that with all other variables held constant, a one unit improvement in their Overall Quality score could increase their sales price by approximately 7.13%.
Random Forest Model
The last machine learning model we created was Random Forest. The benefits of this model were two-fold: we were able to gauge the performance of a tree-based model on predicting sales prices, while garnering feature importance, as well. In assessing its predictive power, the random forest model did not perform as well as the linear regression models. We suspect that with more hyperparameter tuning and feature engineering, we would be able to increase its score.
Nevertheless, the Random Forest still provided valuable information on feature importance. We were able to compare the features deemed important by this model to the ones provided by our Lasso model. Fortunately, we saw a majority of features were identified as important by both models, strengthening the case of their significance. Armed with a strong models that yielded promising results, we were able to confidently continue to providing additional EDA and identifying key features for our clients with critical and convincing data.
In order to get a better understanding of the town overall, we first looked at the neighborhoods of Ames. Taking a count of houses sold in each neighborhood, we visualized the top 10, which accounted for 70% of the houses in the dataset. As shown below, a comparison of the neighborhoods based on average sale price and average square foot living area displayed generally similar trends across the plots for each neighborhood. This supported the finding in our model that living area is one of the most significant indicators of sale price.
Sale Data Trends
The heat map shows a general positive relationship between sale price and the year that the house was built, indicating that newer built houses in Ames are generally more expensive. The concentrated, dark pockets in the graph can be explained by two events. The first was the post-war housing boom period for the country around the 1950's, and the second being the housing boom leading up to the 2008 financial crisis.
Knowing when to list your house is crucial to getting the most value out of your sale. Visualizing the months that houses were sold in Ames, the apparent normal distribution of the bar plot made it evident that the summer months contained many more sales. As a housing consultant we would advise a client to list their house during the months of May, June, or July to increase their chances of making a sale.
Proximity to Iowa State University
In order to create additional features for our model, we integrated geospatial data of the houses in our data set. Our presumption was that as a college town, the proximity to Iowa State University would play a significant part in the house price. However, the first graph below shows that the majority of homes sold were between 3-4 miles from campus and that proximity did not necessarily translate to higher-priced homes. The second graph confirms this sentiment, as the similarity of the graphs presents a stronger correlation between the living area and price, over the distance to ISU.
Home and Zoning Classifications
The graph above shows that the majority of home sales are in Residential Low density zones, rather than the Mid-to-High density residential areas closer to ISU and downtown Ames. The stacked histogram below reveals that the overwhelming amount of building types were single-family homes.
To take on the role of a home improvement advisor, we investigated renovation-specific coefficients that our model deemed significant. One of the features of highest significance was “Overall Quality,” which refers to the curb appeal of your home. If you have ever been to an open house, you know the first impression when you pull up has a lasting impact and is usually a good indicator of what is actually inside the home.
Having our target variable--sale price--in log form, we exponentiated our beta values in order to make them interpretable. Therefore, our OLS model returned that holding all else constant, for every one unit increase on the scale of overall quality will increase the sale price by 7.13% on average. Improving curb appeal is crucial to a home sale. It can include updates to your landscaping, front entrance, exterior paint, and siding.
To further emphasize the importance of curb appeal, our model returned specific factors in this category including type of driveway and the quality of the home’s exterior. The OLS model showed that upgrading the driveway from dirt or gravel to a partial pavement can on average increase the sale price by 1.99%. In addition, the upward trend shown in sale price as you improve the quality of your house exterior shows that for every one unit increase in the quality, you can on average increase sale price by 3.2%, holding all else constant.
Heating Ventilation Air Conditioning (HVAC)
We found that on average, the presence of a central air conditioning system increased about the home value by approximately 4.53%. As consultants however, we would have to determine the home’s ability to install the necessary ductwork before factoring in the cost vs. potential return-on-investment due to its high cost.
Heating Quality attributed approximately 1.73% to a home’s value, when holding other features constant. The positive trend is clear when upgrading the heating quality from poor to excellent. This is further supported when inspecting the top ten neighborhoods. The most expensive neighborhoods--Northridge Heights, Somerset, and Sawyer West-- only have top quality heating, signifying its importance.
Upon further research on weather in Ames, the correlation between home value and heating quality makes sense. Home heating is a necessity, with average monthly temperature below freezing and daily lows in the teens between December and February.
Kitchen Quality was another key feature in influencing home price. Generally, we found an approximately 3.07% in home price with every increase in kitchen quality. The correlation makes sense, as the kitchen being one of the most utilized spaces in the home. Furthermore, the amount of money that goes into every square foot of the kitchen (tile, cabinetry, appliances, etc.) is higher when compared to other rooms of the home.
Four basement related coefficients were returned as significant in our model, indicating that it is an important predictor of sale price. To analyze from a home renovator’s perspective, we decided to take a closer look at basement finish type. From the box plot one can see that having no basement generally showed lower sales prices.
However, following that, one unit upgrades in your basement finish don’t show much of a change in price until you get to the last type, good living quarters. Due to this, we would advise our clients that it might not be worth your while to upgrade the basement in terms of improving sale price unless you are able to get it up to those good living quarters standards.
With the added dimension of total basement square footage--another significant variable in our model--this interesting visualization of Ames overall displays that the houses with the highest listed sale price had the largest total basement square footage and were generally all up to good living quarter standards.
In our assessment of garage size and its effect on home price, we found a positive correlation attributing to approximately 2.23% to home value for each increase in the size--measured in the number of cars that can fit in the garage. An interesting trend to note here is that the positive correlation peaks at 3-car garages.
We can deduce that the decline in value of 4 or 5-car garages is due to the decrease in lot area of those same sizes, as presented on the second graph. The lot area plays a significant role in the home price, and its effect can be felt in the homes with 4 or 5-car garages.
As home renovation consultants, while we cannot recommend our clients to build out more garages, we can recommend our clients to improve the finish of the garage. As displayed in the boxplot below, clients would see a consistent return on investment, as they upgrade from unfinished, semi-finished, to finished garages.
In the creation of our predictive housing price model, we identified the features that had the biggest effect on sales price through various machine learning models and hyperparameter-tuning methods. As home renovation consultants, we further focused features we can recommend addressing with our clients: Curb appeal, HVAC, and qualities of kitchens, basements, and garages.
Further works to expand this project, would focus on more models and hyperparameter-tuning techniques. As home renovation consultants, we would also aim to explore more feature engineering to expand our renovation recommendations to our clients.