Predicting Housing Prices with Machine Learning

Jose Gonzalez
, and
Posted on Sep 5, 2019

In this project our goal was to build accurate models to predict sale prices for houses in Ames, Iowa. We completed an entire machine learning project workflow, from exploring the data to ending with a presentation we made to our class. Our team name was Training Day, named after the movie as well as a play on training models that we would use in our project. As part of a Kaggle competition we had to score our models based on the guidelines given, which we will discuss at the end.


We began with exploratory data analysis (EDA) and we found that this helped us to understand our dataset and a lot of the relationships between different features within it. One of the first things we wanted to understand was the relationship between our “SalePrice” variable and other features. During our EDA we decided to use a heatmap to understand the linear correlations between “SalePrice” and our other features better. 


We also found that there were outliers in our dataset.  As you can see from the visualizations below, we visually identified that there were outliers and then subsequently removed them. We did this because outliers could lead to misrepresentation of averages in our data, which could cause our models to overfit.

Another reason it was important to understand and remove outliers, is because the dataset is skewed, outliers would have exacerbated this skew. Skewness is the asymmetry of a distribution inherent in a dataset, in this case our data had a rightward skew, to deal with this we decided to use a log transformation as shown below.

In addition to the things we discovered above, there was a lot of missing data. Many of the features having very large amounts of missing data, gave us an early understanding of how we should approach feature engineering which we will address in more detail later. This part of EDA naturally led into missingness and imputation, where we replaced many of the columns that contained “NA” with 0 or None depending on the column. We also replaced some of the 0’s with mean/median, in addition to dropping many features. The features that we ended up dropping include: “Id”, “PoolQC”, “Utilities”, “MiscFeature”, “MiscVal”, and “Alley”.

Along with how we dealt with missingness and imputation which we just talked about, many of the non-numerical categories were dummified or given ordinal values. Dummification is essentially converting categorical values into dummy variables that can then be used and manipulated much easier. After dealing with these issues it was much easier to complete our feature engineering.

For our feature engineering we decided to create new features, these included “Total Baths”, “Total SF”, median price by year, whether it was remodeled or not (1 or 0), and one for if the house was new called “New House”. Redundant columns were then dropped, we also found that “Neighborhood” was one of our most important and strongly correlated features. We decided to make a box plot based on “Neighborhood” and partitioned it into eighths because it had a 0.74 correlation with “SalePrice”, the third strongest of any feature.


When finished, we built machine learning models. We used GridSearchCV for our hyperparamter tuning and then proceeded to use the following models: Ridge, regular linear regression, lasso, and we even tried a few others like XGBoost but decided to discard them. In scoring our model we used the Kaggle competition guidelines and evaluated it based on Root-Mean-Squared-Error (RMSE). This means that our score accounts for the logarithm of the predicted value and the logarithm of the observed sales price, measuring our error in prediction between the two. Here is an example of the formula below:

Ridge performed moderately well compared to our other models with an RMSE of 0.1109. Lasso selected 54 variables and eliminated 131, but did not perform well in comparsion with an RMSE of 0.1083. Our best performing model was also the simplest, regular linear regression had an RMSE score of 0.1308.

For future work, if given more time we would have included outside datasets and/or sources. We wanted to introduce the Ames, Iowa housing price index. More domain knowledge such as which types of roof is more valuable would have also been very beneficial. Investigating if there was a seasonality effect of students leaving during school breaks, since they were a large percentage of the population in Ames. Incorporating economic data would have also been very interesting similar to the graph obtained from FRED underneath is an example of this:

In conclusion, our models seemed to perform fairly well and we were happy with our progress, we did not utilize and of the higher-end or more advanced models and still thought we had good results overall. We learned a lot throughout the process and would like to continue to develop our machine learning skills doing similar projects in the future.

About Authors

Leave a Comment

Your Name September 9, 2019

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp