Using Data to Analyze Home Renovations

, and
Posted on Oct 13, 2020


Investing in real estate can be an excellent way to make a profit. One way to increase the value of a home is to renovate, but often, homeowners spend a lot of money on renovations expecting to recoup more than they actually will. In reality some projects just offer better returns than others. In this text we will use data to analyze home renovations.

This return on investment depends on a myriad of conditions, some of which include: the type of renovation, the condition of the house (or part of the house), and location.  Our purpose is to guide homeowners and house flippers through the difficult decision making process of deciding where to focus renovation efforts and budget by constructing an interpretable machine learning model to pinpoint the most profitable projects.

Background and Data

Professor Dean De Cock of Truman State University compiled a dataset looking at house prices in Ames, Iowa from 2006 to 2010. This dataset is often used for training in regression techniques and Kaggel competitions. Our data set contains 2,580 observations and 81 features, two of which differ from the Kaggle dataset: the target feature, Sales Price, and PID, a land parcel ID that can be used to find the property address and the long-lat coordinates of the houses using geopy.

Data Exploration 

There are 80 columns in the dataset describing the house attributes range from location, quality, condition, basement information, garage information and so on. Since our target variable is the house price, we looked at its distribution:

The distribution of the target feature, sales price is right skewed. As this violated the normality assumption for Multiple Linear Regression, we log transformed our target variable:

Data Cleaning

Many features contained missing values. Often this was linked to the absence of a house feature. For example any basement related values would be missing if the house did not have a basement. In these cases we imputed 0 for continuous features, and ‘None’ for categorical variables. Other missing data was not as frequent and seemed to be a case of missing completely at random, so we imputed the median or mode values.

Feature Engineering

Since the primary goal of our project is to build a machine learning model that can be easily interpreted to obtain insights for housing renovations, we wanted to keep the model simple as possible while still balancing the model predictive power. Feature engineering is an important step towards building streamlined models.

Ordinal Categorical Features: 

There were 14 ordinal features in the original dataset rating the quality / condition of specific parts of the houses (e.g. excellent, good, average, fair, and poor). These were represented by strings, so we converted them to numeric values so that these features could be used in our linear regression models. 

Continuous Features:

Then we reduced the number of continuous variables that expressed related or redundant information. For example, there were four variables related to bathrooms - Number of full baths, number of half baths, number of basement full baths, number of basement half baths - we combined these four variables into a single variable: total bathroom count. We also created a custom variable that reflected the ‘age’ of the house based on year built, year remodeled and year sold.

Lastly, we realized the variable GrLivArea (Above Ground Living Area) was the sum of three other existing variables (1stFlrSF, 2ndFlrSF, and LowQualFinSF). We decided to drop GrLivArea in favor of 1stFlrSF and 2ndFlrSF to avoid perfect multicolinearity and ran multiple linear regression on the 1st and 2nd floor features against our target features. We compared the resulting R^2 and decided to move forward with just 1stFlrSF, 2ndFlrSF by creating our custom variable log(1stFlrSF + 2ndFlrSF). 

Categorical Features: 

Through our exploratory data analysis, we looked into each of the categorical features to understand its relationship with our target variable log(SalePrice) and tried to reduce the number of factors within a given categorical variable. For example, there are 28 neighborhoods in the dataset, we simplified this feature by combining all neighborhoods into three tiers based on price/Sqft, illustrated below.

This strikes a good balance between reducing the complexity of our model while preserving the relevant location characteristics of neighborhoods. We performed similar data engineering on a few other categorical features such as House Types, Paved Drive, and Sales Conditions, binning similar categories together.

A Brief Note on Neighborhood Tiers 

This graph shows how the three neighborhood tiers lay out geographically. We can see that the cheapest neighborhood houses (the red dots), tend to be near the downtown/ urban areas of Ames, whereas the more expensive neighborhoods (green and blue dots) are spread further into the suburb areas. After some investigation on, we found this pattern correlated with crime rates, where low priced neighborhoods tended to be areas with higher crime rates. Crime rate may be one of the external factors that influence house value that was not exposed in this dataset.


We used Random Forest and BIC stepwise regression as a first round to roughly determine feature importance. We used these insights to inform another round of feature selection/engineering. For example, after looking at the VIF(variance inflation factor) of our forward BIC stepwise model, we examined features with high VIF scores and found the quality scores highly multicollinear.

We decided to omit overall quality due to this multicollinearity in favor of the individual quality scores. Square footage features also had high VIF scores, so we omitted the total square footage features and created a new feature that combined the 1st and 2nd floor square footage in its place.


Second Round: Forward stepwise BIC














We fit multiple models using different combinations of features that were both relatively important in LASSO and utilized in the stepwise regression. Then we compared these model’s cross validation scores and standard deviations, favoring the simplest model that kept our variables of interest with little comparative loss in R^2.

Conclusions and Business Recommendations

We wanted to provide a list of renovation suggestions not just based off of our model’s coefficients, but based off a relative return on specific renovations by including renovation costs. We pulled national averages for household renovations from to estimate the relative costs of each type of renovation that could be done in these houses to improve home value.

Some renovation types represented below include total and partial renovations where a total kitchen renovation included hiring a designer, replacing countertops, sink, appliances, flooring and cabinets, while a partial kitchen renovation included replacing flooring and countertops.

Our target region on this graph is the lower right corner, where the renovation not only potentially provides more value to the home based off of our model, but is also a relatively cheaper type of home renovation. These are the types of renovations we suggest prioritizing when renovating a home.

To better represent how we rank these renovations, we created a cost profit ratio by dividing the log increase in house price by the cost of renovation. Essentially, the larger the ratio, the better the potential return on investment. Adding a fireplace ranks highest, followed by paving a driveway if unpaved, and partially renovating a kitchen.

To explore price sensitivity of these renovations in different neighborhoods, we used the neighborhood tiers we engineered earlier to model and compare high, medium, and low mean price/sqft neighborhoods. Looking at the tiers one at a time, we suggest, if planning to flip a home or if intending to look for a home to invest in, consider homes with lower quality kitchens and unpaved driveways in tier 1 (high mean price/sqft) neighborhoods and renovate these features to get the greatest return on investment.

For tier 2 (medium mean price/sqft) neighborhoods(, consider houses with lower exterior quality and no fireplaces. Lastly in tier three (low mean price/sqft) neighborhoods, consider houses with low quality kitchens, no fireplaces or unpaved drives to make the most cost effective and profitable renovations.

Key Takeaways

Overall, disregarding location and neighborhood, our model suggests that the best way to increase home value in Ames, Iowa, is to build a fireplace, pave the driveway, or do moderate kitchen renovations

When considering homes to invest in, look for houses with lower quality kitchens and unpaved driveways in more expensive neighborhoods, houses with no fireplaces in medium priced neighborhoods, and houses without fireplaces, unpaved drives, or lower quality kitchens in lower priced neighborhoods. These homes have the potential to give you better returns on your investment after some targeted renovations.

Contact Us

If you have any questions or comments, please feel free to reach out to any of us on LinkedIn or GitHub.

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


About Authors

Judy Chung

Data Scientist with five years of lab manager experience in microbiology and ecology. Strong research professional with a Bachelor's degree in Integrative Biology from University of California, Berkeley. Excited to generate data-driven insights to drive business value.
View all posts by Judy Chung >

Qing Ying

Qing (Sophie) Ying graduated from UC Berkeley with a Master's degree in Industrial Engineering and Operations Research. She has been a product manager in a healthcare technology startup for 3 years, where she developed various data analytics products...
View all posts by Qing Ying >

Xiaogang Zhu

I am graduating from the University of Florida majoring in Ocean Engineering. When I am pursuing my master's degree I also earned a Statistics Minor degree. I got my Ocean Engineering bachelor's degree at the Ocean University of...
View all posts by Xiaogang Zhu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI