Upgrade Everything But the Kitchen Sink?

Posted on Jun 8, 2020


When it comes time to sell a home, everyone hopes to sell for as high as possible. While many aspects of a home contribute to its overall value, homeowners are unable to know which aspects reign more important than others. While certain factors are out of the control of owners, such as neighborhood, there are improvements and renovations they can make to potentially increase their value. But which features should be prioritized and are worth the costs? Our project sets out to answer this.

Through Kaggle.com, we acquired data on the home sale prices of homes in Ames, Iowa including over 80 features specific to each house. Through the use of machine learning algorithms paired with EDA and feature engineering, we were able to create two models: one to identify features with the most impact on value and one model to predict value based on the previous model’s selected features.

Through our analysis, we found that the most important features are the overall quality, number of fireplaces and full bathrooms, quality of the kitchen, and number of basement full bathrooms. Our final prediction model had an R-squared value of 0.805.


Upon first glance at the dataset, there was an overwhelming amount of NA’s within many of the categorical features. With a closer look at the documentation, we see that each NA describes the lack of a feature. Thus we to improve interpretability of the data, we have imputed string values for those homes with NA under a feature.

Feature Selection

With 36 categorical variables, we noticed that not all features within a category were significant enough to include in our model. In addition, to dummify all features would introduce a large amount of scarcity and increase the total computing power needed. To avoid this, we analyzed the mean, medians, and distributions of each categories feature through the use of box plots. Those with IQRs and medians occupying a similar space would be considered for grouping into one feature. However, box plots do not visualize the count of homes with a certain categorical. We categorized the home prices into bins to create bar charts to visualize this. While a certain feature might occupy a significantly different distribution space and median, the bar chart will show us if the count is large enough for it to stand alone or to be grouped.

With over 80 features of a home documented, there was bound to be multicollinearity among its features. To combat this issue we regressed each numerical category against those remaining. Regressions with a large R-squared indicate that a heavy presence of collinearity and was used as the basis to drop or retain certain numerical values.

For our final step of feature selection, we ran a lasso regression on the remaining features. By using a grid search on with a C value iterating from 1 to 1000. By using lasso regression, it assists with the issue of multicollinearity that remains. As the C value on the regression increases, the coefficients of the less significant features converge to 0. We continue to examine our lasso regressions until we find a C value when there only remains 4 upgradable features left.


While there are many models to choose from, our main focus was the dollar amount interpretability of a model’s output. Certain models actually performed better in terms of R-squared however, their outputs could only be interpreted in terms of relativity importance. As a result, we decided on using the linear regression model predict homes sale values.

For our categorical features that required dummy variables (kitchen quality and basement finish type), we drop the low-end features. Thus, they are now used as the baseline for our model.


From our coefficients we see that kitchen quality has the potential to add the most value to a home, followed by number of full bathrooms, number of basement full baths, and lastly by the basement finishing type.

In order for the renovation to be worth it, the value added must exceed the renovation costs. We estimated these costs using average prices quoted specifically to Iowa. For quality variables such as the kitchen and basement finish type, the cost was estimated based on how much renovation would most likely need to be done to move from low-end to mid and the  to high.

From our graph you can see that an excellent kitchen offers the most return on investment followed by upgrading to a good kitchen. While the net estimated return for a full bath might be positive based on our results, however due to the narrow margin, more inspection on the specific home in question should be done. Unfortunately, a renovated basement is not worth the costs.


About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp