Evaluating Home Buying Opportunities with Machine Learning

Marek Kwasnica
Posted on Jun 10, 2020


Purchasing a home is one of the most consequential decisions that any person can make. Whether it is to procure a home for your family, invest in a high-returning asset, or to do a bit of both, it is pivotal that a prospective home buyer makes this purchase at the correct price. Having said this, determining the appropriate value of a home can be a significant challenge due to the fact that the valuation of a home is dependent on a seemingly endless number of features that a home can possess. To address this challenge, this project endeavors to implement machine learning techniques to predict the sales price of houses in Ames, Iowa.

Which home features are the most impactful on a home price?

Which of these can make the life-changing decision to purchase a home a correct decision?

Please join me to find out. 

Data Set and Model Evaluation

The dataset set utilized in this exercise was curated from Kaggle and contains information on nearly 3000 homes in Ames, Iowa. All of these homes have 79 features that paint a full picture of every aspect of a particular home. These features are both categorical and continuous in nature will be considered by our machine learning models in evaluating the final price of each home.

The performance of the machine learning models in this project will be evaluated by comparing the predicted home values with the actual sales price. The metric through which we measure this will be the Root Mean Squared Error (RMSE).

Data Cleaning

Before training our models, it is pivotal to examine our data and transform them in a manner that is suitable for our predictive models.

Observing the Dependent Variable (Sale Price)

As a start, when we observe what we are trying to predict, we can see that Sale Price is positively (or right) skewed with much of the home prices landing the $100k to $200k range. This indicates that our Sale Price is not normally distributed, thus causing issues when we train linear models. Plotting the quantiles of this distribution against the normal distribution through a Q-Q plot further confirms this suspicion.

To address this issue, applying a logarithmic transformation will make the distribution of Sale Price more fit for training our models as we can confirm with the histogram and Q-Q plot that we see after applying this transformation.

Addressing Outliers

Moving along, we further transform our data by removing outliers. The two salient features that help in identifying outliers in this exercise are General Living Area and Total Basement Square Footage

General Living Area

Here we see two observations with substantial living areas that have Sale Prices that are not consistent with what we would expect.

When we remove these outliers, the linear relationship between Sales Price and General Living Area is a little clearer.

Total Basement Square Footage

On a similar note, we see homes with significantly larger basements, which do not seem to be in line with the linear relationship between basement size and sale price.

Removing this outlier allows us to more easily see the relationship between the variables.

Missing Data / Imputation

The next step in preparing our data lies in accounting for missing data. There are 30 features in this data set - all but 6 of them have less than 5% of observations missing as shown below:

To remedy this issue, we must impute missing values or drop less important features with missing values. Broadly speaking imputation methods can be described and summarized below:

  • Categorical variables (missingness implies the absence of the variable)
  • Continuous variables (missingness typically implies zero, but in some circumstances, mean or median is used)
  • Other (continuous variables where mean or median are appropriate or variables that can be dropped altogether due to full imbalance)

Model Performance

Now that we’ve prepared our data for our models, we can train our models and evaluate which are the most effective in predicting home prices.   

Using our methods, we find that Lasso Regression is the most effective model on a standalone basis (RMSE of 0.124). However, as there are merits to the different methodologies, such as XGBoost’s capacity to capture nonlinear relationships or Random Forest’s effectiveness in decorrelating the data, the best method overall, is to ensemble the methodologies with equal weight (RMSE of 0.119).

Feature Importance

In our mission to determine which features have the most impact on Sale Price, we can conduct a process called regularization, which isolates the most important features through leveraging a penalty term, alpha, to eliminate less important features, making their coefficient values in linear regression zero.

When we apply regularization through our Lasso Regression model, we see that the important variables that can be leveraged to generate home value are:

  • Overall Quality
  • General Living Area
  • Total Bathrooms
  • Garage Car Capacity
  • Basement Square Footage

All of these variables can serve as levers to build value and present opportunities to purchase homes at a price that is favorable.


In conclusion, when leveraging machine learning techniques to make an informed home purchasing decision, it is best to ensemble different machine learning models, all with their own merits, to extrapolate a valuation. That said, once a home is purchased and avenues to improve home value are being considered, we humbly offer the following recommendations to generate value:

  • Renovate the material and finish of the home to improve Overall Quality
  • Build a new wing in the home to expand General Living Area
  • Add additional bathrooms to the home
  • Expand car capacity in the garage
  • Consider renovating and expanding the basement

About Authors


Richard Choi

Richard Choi is a Data Science Fellow in the January 2020 cohort at NYC Data Science Academy. He has 6 years of experience working for Global 500 companies such as HSBC and Unilever and is proficient in Python,...
View all posts by Richard Choi >
Marek Kwasnica

Marek Kwasnica

Marek is currently a data science fellow at NYC Data Science Academy. He has several years experience in biomedical engineering research. He holds a Masters of Engineering in Biological Engineering from Cornell University. Marek is passionate about applying...
View all posts by Marek Kwasnica >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp