Data Analysis on the Valuable Features in Houses

Posted on Jun 8, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


When it comes time to sell a home, everyone hopes to sell for as high as possible. While many aspects of a home contribute to its overall value, homeowners are unable to know which aspects reign more important than others. While certain factors are out of the control of owners, such as neighborhood, there are improvements and renovations they can make to potentially increase their value. But which features should be prioritized and are worth the costs? Our project sets out to use data to answer this.

Through Kaggle, we acquired data on the home sale prices of homes in Ames, Iowa including over 80 features specific to each house. Through the use of machine learning algorithms paired with EDA and feature engineering, we were able to create two models: one to identify features with the most impact on value and one model to predict value based on the previous model’s selected features.

From our analysis, we found that the most important features are the overall quality, number of fireplaces and full bathrooms, quality of the kitchen, and number of basement full bathrooms. Our final prediction model had an R-squared value of 0.805.


Upon first glance at the dataset, there was an overwhelming amount of NA’s within many of the categorical features. With a closer look at the documentation, we see that each NA describes the lack of a feature. Thus we to improve interpretability of the data, we have imputed string values for those homes with NA under a feature.

Feature Selection Data

With 36 categorical variables, we noticed that not all features within a category were significant enough to include in our model. In addition, to dummify all features would introduce a large amount of scarcity and increase the total computing power needed.

To avoid this, we analyzed the mean, medians, and distributions of each categories feature through the use of box plots. Those with IQRs and medians occupying a similar space would be considered for grouping into one feature. However, box plots do not visualize the count of homes with a certain categorical. We categorized the home prices into bins to create bar charts to visualize this. While a certain feature might occupy a significantly different distribution space and median, the bar chart will show us if the count is large enough for it to stand alone or to be grouped.

With over 80 features of a home documented, there was bound to be multicollinearity among its features. To combat this issue we regressed each numerical category against those remaining. Regressions with a large R-squared indicate that a heavy presence of collinearity and was used as the basis to drop or retain certain numerical values.

For our final step of feature selection, we ran a lasso regression on the remaining features. By using a grid search on with a C value iterating from 1 to 1000. By using lasso regression, it assists with the issue of multicollinearity that remains. As the C value on the regression increases, the coefficients of the less significant features converge to 0. We continue to examine our lasso regressions until we find a C value when there only remains 4 upgradable features left.

Modeling Data Selection

While there are many models to choose from, our main focus was the dollar amount interpretability of a model’s output. Certain models actually performed better in terms of R-squared however, their outputs could only be interpreted in terms of relativity importance. As a result, we decided on using the linear regression model predict homes sale values.

For our categorical features that required dummy variables (kitchen quality and basement finish type), we drop the low-end features. Thus, they are now used as the baseline for our model.


From our coefficients we see that kitchen quality has the potential to add the most value to a home, followed by number of full bathrooms, number of basement full baths, and lastly by the basement finishing type.

In order for the renovation to be worth it, the value added must exceed the renovation costs. We estimated these costs using average prices quoted specifically to Iowa. For quality variables such as the kitchen and basement finish type, the cost was estimated based on how much renovation would most likely need to be done to move from low-end to mid and the Β to high.

From our graph you can see that an excellent kitchen offers the most return on investment followed by upgrading to a good kitchen. While the net estimated return for a full bath might be positive based on our results, however due to the narrow margin, more inspection on the specific home in question should be done. Unfortunately, a renovated basement is not worth the costs.


About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI