Using Data to Predict Housing Prices in Ames, Iowa

, and
Posted on Sep 5, 2019

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data Science Introduction

In this project, we will use data to build accurate models to predict sale prices for houses in Ames, Iowa. We completed an entire machine learning project workflow, from exploring the data to ending with a presentation we made to our class. Our team name was Training Day, named after the movie as well as a play on training models that we would use in our project. As part of a Kaggle competition, we had to score our models based on the guidelines given, which we will discuss at the end.

Using Data to Predict Housing Prices in Ames, IowaUsing Data to Predict Housing Prices in Ames, Iowa

 

EDA

We began with exploratory data analysis (EDA), and we found that this helped us to understand our dataset and a lot of the relationships between different features within it. One of the first things we wanted to understand was the relationship between our “SalePrice” variable and other features. During our EDA, we decided to use a heatmap to understand the linear correlations between “SalePrice” and our other features better. 

Using Data to Predict Housing Prices in Ames, Iowa

 

Outliers

We also found that there were outliers in our dataset.  As you can see from the visualizations below, we visually identified that there were outliers and then subsequently removed them. We did this because outliers could lead to misrepresentation of averages in our data, which could cause our models to overfit.

Using Data to Predict Housing Prices in Ames, IowaUsing Data to Predict Housing Prices in Ames, Iowa

Another reason it was important to understand and remove outliers, is because the dataset is skewed, outliers would have exacerbated this skew. Skewness is the asymmetry of a distribution inherent in a dataset, in this case our data had a rightward skew, to deal with this we decided to use a log transformation as shown below.

Using Data to Predict Housing Prices in Ames, Iowa

Missing Data

In addition to the things we discovered above, there was a lot of missing data. Many of the features having very large amounts of missing data, gave us an early understanding of how we should approach feature engineering which we will address in more detail later. This part of EDA naturally led into missingness and imputation, where we replaced many of the columns that contained “NA” with 0 or None depending on the column. We also replaced some of the 0’s with mean/median, in addition to dropping many features. The features that we ended up dropping include: “Id”, “PoolQC”, “Utilities”, “MiscFeature”, “MiscVal”, and “Alley”.

Along with how we dealt with missingness and imputation which we just talked about, many of the non-numerical categories were dummified or given ordinal values. Dummification is essentially converting categorical values into dummy variables that can then be used and manipulated much easier. After dealing with these issues it was much easier to complete our feature engineering.

For our feature engineering we decided to create new features, these included “Total Baths”, “Total SF”, median price by year, whether it was remodeled or not (1 or 0), and one for if the house was new called “New House”. Redundant columns were then dropped, we also found that “Neighborhood” was one of our most important and strongly correlated features. We decided to make a box plot based on “Neighborhood” and partitioned it into eighths because it had a 0.74 correlation with “SalePrice”, the third strongest of any feature.

 

Using Data to Predict Housing Prices in Ames, Iowa

Machine Learning Models

When finished, we built machine learning models. We used GridSearchCV for our hyperparamter tuning and then proceeded to use the following models: Ridge, regular linear regression, lasso, and we even tried a few others like XGBoost but decided to discard them. In scoring our model we used the Kaggle competition guidelines and evaluated it based on Root-Mean-Squared-Error (RMSE). This means that our score accounts for the logarithm of the predicted value and the logarithm of the observed sales price, measuring our error in prediction between the two. Here is an example of the formula below:

Using Data to Predict Housing Prices in Ames, Iowa

Ridge performed moderately well compared to our other models with an RMSE of 0.1109. Lasso selected 54 variables and eliminated 131, but did not perform well in comparsion with an RMSE of 0.1083. Our best performing model was also the simplest, regular linear regression had an RMSE score of 0.1308.

For future work, if given more time we would have included outside datasets and/or sources. We wanted to introduce the Ames, Iowa housing price index. More domain knowledge such as which types of roof is more valuable would have also been very beneficial. Investigating if there was a seasonality effect of students leaving during school breaks, since they were a large percentage of the population in Ames. Incorporating economic data would have also been very interesting similar to the graph obtained from FRED underneath is an example of this:

Using Data to Predict Housing Prices in Ames, Iowa

Conclusion

In conclusion, our models seemed to perform fairly well and we were happy with our progress. We did not utilize and of the higher-end or more advanced models and still thought we had good results overall. We learned a lot throughout the process and would like to continue to develop our machine learning skills doing similar projects in the future.

About Authors

Leave a Comment

Your Name September 9, 2019
Test

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI