Data Analysis in Predictive Modeling to Enhance Home Profit

, and
Posted on Mar 16, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Thank you for taking the time to read our research! Please feel free to use the links below to explore our code on GitHub.


 

The housing data of Ames, Iowa challenges budding Data Scientists in many aspects through data wrangling, preparation, and predictive modeling. It proved itself as a popular dataset in the data science community, providing an excellent opportunity to test out machine learning models on feature-rich information. For every observation of a completed sale, it contained 80 home features associated with the house that was sold. The success of our project were determined by three objectives:

  1. Create machine learning models that would accurately predict a home's sale price in Ames
  2. Perform Exploratory Data Analysis to understand the housing market in Ames
  3. Assume the role of a home-improvement advisor. Through information gathered from our EDA and results of our machine-learning models, we would be able to provide recommendations to homeowners on key home features to improve in order to have the greatest positive impact to their home value

Data Preparation

The dataset consisted of quantitative, ordinal, and categorical variables.  Prior to running the dataset through our models:

  • We log-transformed the "SalePrice", "GrLivArea," and "LotArea" variables.  These were all very important variables, under our presumable understanding that "GrLivArea" and "LotArea" would be key drivers of our target value, "SalePrice."  Most importantly--during our EDA--we found that their distributions were all right-skewed. Log-transformation was a quick solution to address the skewed data.
  • In order to address the missing data, we use our best judgment to fill the NAs with the mean or zeros. For most features, missing data was typically a result of not having that feature. For example, the size of the pool was a variable that a majority of homes were missing. This was due to the home not having a pool.
  • We took the time to label-encode or dummify many of the the categorical and ordinal features. The dummification was performed in order to apply linear regression models; it was not required for the random forest model.

Lastly, we removed houses that had GrLivAreas over 4000 square foot.  When plotting these features, these four homes were clear outliers that justified their removal.

Data Model Summary

Utilizing Predictive Modeling Data to Maximize Home Profits

Lasso and Stepwise Model

Our highest performing models were the Lasso model (regularized linear regression) and Stepwise model (regression model that iteratively adds or removes features to achieve the best performance).  Both of these models satisfied our stated objectives of creating accurate predictive models.  However, we also needed assistance in narrowing down tangible features to offer our clients as home improvement advisors. 

Our Lasso model, had a very small alpha, thus returning too many features as significant (non-0). When we raised the alpha, this reduced the R-squared output of the model while completing the crucial task of narrowing the number of significant features down to 20.

Having reduced the number of features to a more manageable and interpretable quantity, we applied the OLS (ordinary least squares) regression model

This was a critical step for us as home improvement consultants because through this model, along with our aforementioned log-transformed "SalePrice" target variable, we could now equate changes in home features to actual percentage changes to their home price.  As an example, we could now tell homeowners that with all other variables held constant,  a one unit improvement in their Overall Quality score could increase their sales price by approximately 7.13%.

Random Forest Model

The last machine learning model we created was Random Forest.  The benefits of this model were two-fold: we were able to gauge the performance of a tree-based model on predicting sales prices, while garnering feature importance, as well. In assessing its predictive power, the random forest model did not perform as well as the linear regression models.  We suspect that with more hyperparameter tuning and feature engineering, we would be able to increase its score. 

Nevertheless, the Random Forest still provided valuable information on feature importance. We were able to compare the features deemed important by this model to the ones provided by our Lasso model. Fortunately, we saw a majority of features were identified as important by both models, strengthening the case of their significance. Armed with a strong models that yielded promising results,  we were able to confidently continue to providing additional EDA and identifying key features for our clients with critical and convincing data.

Discovering Ames

In order to get a better understanding of the town overall, we first looked at the neighborhoods of Ames. Taking a count of houses sold in each neighborhood, we visualized the top 10, which accounted for 70% of the houses in the dataset.  As shown below, a comparison of the neighborhoods based on average sale price and average square foot living area displayed generally similar trends across the plots for each neighborhood. This supported the finding in our model that living area is one of the most significant indicators of sale price.

Utilizing Predictive Modeling Data to Maximize Home Profits Utilizing Predictive Modeling Data to Maximize Home Profits

Sale Data Trends

The heat map shows a general positive relationship between sale price and the year that the house was built, indicating that newer built houses in Ames are generally more expensive. The concentrated, dark pockets in the graph can be explained by two events. The first was the post-war housing boom period for the country around the 1950's, and the second being the housing boom leading up to the 2008 financial crisis.

Knowing when to list your house is crucial to getting the most value out of your sale. Visualizing the months that houses were sold in Ames, the apparent normal distribution of the bar plot made it evident that the summer months contained many more sales. As a housing consultant we would advise a client to list their house during the months of May, June, or July to increase their chances of making a sale.

Proximity to Iowa State University

In order to create additional features for our model, we integrated geospatial data of the houses in our data set. Our presumption was that as a college town, the proximity to Iowa State University would play a significant part in the house price. However, the first graph below shows that the majority of homes sold were between 3-4 miles from campus and that proximity did not necessarily translate to higher-priced homes. The second graph confirms this sentiment, as the similarity of the graphs presents a stronger correlation between the living area and price, over the distance to ISU.

Home and Zoning Classifications

The graph above shows that the majority of home sales are in Residential Low density zones, rather than the Mid-to-High density residential areas closer to ISU and downtown Ames. The stacked histogram below reveals that the overwhelming amount of building types were single-family homes.

Home Improvements

Curb Appeal

To take on the role of a home improvement advisor, we investigated renovation-specific coefficients that our model deemed significant. One of the features of highest significance was “Overall Quality,” which refers to the curb appeal of your home. If you have ever been to an open house, you know the first impression when you pull up has a lasting impact and is usually a good indicator of what is actually inside the home.

Having our target variable--sale price--in log form, we exponentiated our beta values in order to make them interpretable. Therefore, our OLS model returned that holding all else constant, for every one unit increase on the scale of overall quality will increase the sale price by 7.13% on average. Improving curb appeal is crucial to a home sale. It can include updates to your landscaping, front entrance, exterior paint, and siding.

To further emphasize the importance of curb appeal, our model returned specific factors in this category including type of driveway and the quality of the home’s exterior. The OLS model showed that upgrading the driveway from dirt or gravel to a partial pavement can on average increase the sale price by 1.99%. In addition, the upward trend shown in sale price as you improve the quality of your house exterior shows that for every one unit increase in the quality, you can on average increase sale price by 3.2%, holding all else constant.

Heating Ventilation Air Conditioning (HVAC)

We found that on average, the presence of a central air conditioning system increased about the home value by approximately 4.53%. As consultants however, we would have to determine the home’s ability to install the necessary ductwork before factoring in the cost vs. potential return-on-investment due to its high cost.

Heating Quality attributed approximately 1.73% to a home’s value, when holding other features constant. The positive trend is clear when upgrading the heating quality from poor to excellent. This is further supported when inspecting the top ten neighborhoods. The most expensive neighborhoods--Northridge Heights, Somerset, and Sawyer West-- only have top quality heating, signifying its importance.

Upon further research on weather in Ames, the correlation between home value and heating quality makes sense. Home heating is a necessity, with average monthly temperature below freezing and daily lows in the teens between December and February.

Kitchen Quality

Kitchen Quality was another key feature in influencing home price. Generally, we found an approximately 3.07% in home price with every increase in kitchen quality. The correlation makes sense, as the kitchen being one of the most utilized spaces in the home. Furthermore, the amount of money that goes into every square foot of the kitchen (tile, cabinetry, appliances, etc.) is higher when compared to other rooms of the home.

Basement

Four basement related coefficients were returned as significant in our model, indicating that it is an important predictor of sale price. To analyze from a home renovator’s perspective, we decided to take a closer look at basement finish type. From the box plot one can see that having no basement generally showed lower sales prices.

However, following that, one unit upgrades in your basement finish don’t show much of a change in price until you get to the last type, good living quarters. Due to this, we would advise our clients that it might not be worth your while to upgrade the basement in terms of improving sale price unless you are able to get it up to those good living quarters standards.

With the added dimension of total basement square footage--another significant variable in our model--this interesting visualization of Ames overall displays that the houses with the highest listed sale price had the largest total basement square footage and were generally all up to good living quarter standards.

Garages

In our assessment of garage size and its effect on home price, we found a positive correlation attributing to approximately 2.23% to home value for each increase in the size--measured in the number of cars that can fit in the garage. An interesting trend to note here is that the positive correlation peaks at 3-car garages.

We can deduce that the decline in value of 4 or 5-car garages is due to the decrease in lot area of those same sizes, as presented on the second graph. The lot area plays a significant role in the home price, and its effect can be felt in the homes with 4 or 5-car garages.

As home renovation consultants, while we cannot recommend our clients to build out more garages, we can recommend our clients to improve the finish of the garage. As displayed in the boxplot below, clients would see a consistent return on investment, as they upgrade from unfinished, semi-finished, to finished garages.

Conclusion

In the creation of our predictive housing price model, we identified the features that had the biggest effect on sales price through various machine learning models and hyperparameter-tuning methods. As home renovation consultants, we further focused features we can recommend addressing with our clients: Curb appeal, HVAC, and qualities of kitchens, basements, and garages. 

Further works to expand this project, would focus on more models and hyperparameter-tuning techniques. As home renovation consultants, we would also aim to explore more feature engineering to expand our renovation recommendations to our clients.

About Authors

David Kim

Prior to enrolling in the NYCDSA bootcamp, I worked as an Operations Development Manager for a multinational hospitality brand. I used my skills in data analysis to help gather insight on the business and translate my findings to...
View all posts by David Kim >

Eugene Ng

Aspiring data scientist
View all posts by Eugene Ng >

Jessica Joy

Recent graduate from Binghamton University with a Bachelor of Science in Financial Economics. Highly motivated problem solver seeking opportunities to leverage data wrangling and analysis skills to provide key insights in real-world business problems.
View all posts by Jessica Joy >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI