Studying Data to Predic Housing Prices in Ames, Iowa

, , and
Posted on Sep 3, 2018
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Predicting Housing Prices in Ames, Iowa

By Ariani Herrera, Erin Dugan, John Nie, Won Kang

This Kaggle Competition offered a comprehensive dataset of nearly 4000 home sales in Ames, Iowa between 2006 and 2010 with the various features influencing the sale prices. Our motivation was to explore and analyze the housing data to find the key features that influenced the sales price and then develop a machine learning algorithm to predict the prices.


Exploratory Data Analysis and Data Transformation

Upon observing the data, our team wanted to create a systematic method to understand the variables before building any sort of model. We sorted through a sample of the data set which included 79 explanatory variables and the corresponding sale price for each home. Our team explored the different data types within the data, which were divided into 38 numerical features and 43 categorical features.


To predict the housing prices, our team explored the variables to evaluate the key features that affect housing prices in Ames, Iowa. We then wanted to understand redundancies, collinearity, and the relationships between different variables.


We began our Exploratory Data Analysis by examining correlation between the features. The heat map below shows the correlation between all the features in the data set. The darker or lighter the color (correlation is either close to 1 or -1), the stronger the relationship between variables. From this graph, we are able to see which features had the greatest influence on the Sale Price and note the correlation amongst the features. If there is multicollinearity, or a strong correlation between features in the model, it can contribute to errors in the model. For example, both GarageYrBlt and YearBuilt were highly correlated, so only YearBuilt was used in the prediction model.

Studying Data to Predic Housing Prices in Ames, Iowa

Our team also wanted to remove information that did not seem relevant or useful. If a variable was missing more than 50% of its data, we removed those columns from our dataset. The graph below is a visual representation of the missing data.

Studying Data to Predic Housing Prices in Ames, Iowa

For other variables having a significant portion of missing data, we decided to make educated assumptions to impute the missing data, such as with Lot Frontage. After some research, our team felt it was fair to assume that we could impute the missing Lot Frontage value with the average frontage for that particular neighborhood, as most in a given neighborhood do not deviate significantly. For other features with missing values, we imputed the missing data with zero or None, depending on the feature.

Excluding the Outliers

Generally speaking, outliers are observations that appear to deviate outside the overall pattern. Also, extreme outliers could dramatically impact the results of a prediction model. We conservatively removed two outliers, which were homes with sale prices over $300,000 and had less than 4,000 square feet of living area. After the outliers were removed, the graphs depicted a more linear relationship between sale price and square footage.

Studying Data to Predic Housing Prices in Ames, Iowa

There were certain numeric variables that actually described categorical features. In the graph below we illustrate this was the case with the month sold. While sales varied by month, there was no numerical pattern represented by the month of the sale.

Some categorical variables showed ordinal features. Kitchen Quality, shown below, is an example of where an excellent kitchen increased  the value of a house. Our team ranked the different kitchen qualities for our model to prioritize the kitchen quality if it is excellent (Ex) over if the kitchen quality was good (Gd), typical/average (TA), or fair (Fa).

Skewed distributions can decrease the accuracy of a model’s predictions because it could reduce the impact of low frequency values which could be equally significant when represented by normally distributed data. As shown in the graphs below, we used a log transformation to normalize the Sales Price. Other skewed features, such as Lot Area, were also adjusted in the same manner.


Machine Learning Model Development


After cleaning and transforming our data, we explored various machine learning algorithms to predict sales prices. To investigate benefits of dimension reduction and look for strong patterns in the dataset, we also evaluated the models with and without Principal Component Analysis. The various machine learning algorithms were optimized with a parameter grid search and used K-Fold cross validation to prevent overfitting. The two paths used in developing our models are shown below.

Our team built a comprehensive and robust model that explored Linear Regression, Lasso, Ridge, Elastic Net, Kernel Ridge, and Random Forest regression models and used a grid search with K-fold cross validation to ensure we had optimal tuning parameters.

We tested these models with and without Principal Component Analysis (PCA).  After cleaning our data, performing EDA, and creating dummy variables for the categorical features, there were over 318 features in the model. 85 Principal Components accounted for 90% of the variance in our data. PCA allowed for us to reduce variation and illustrate strong patterns.

The chart below shows our results after completing Principal Component Analysis for the  Kernel Ridge, Lasso, Random Forest and Elastic Net models. With PCA, the Kernel Ridge, Lasso and Elastic Net models performed the best in predicting housing prices with a Root-Mean Square Error of approximately 0.026 - 0.027.

Unfortunately for our models, PCA has its limitations since it relies on linear assumptions. Although PCA does a great job with data that is linearly correlated, PCA might not capture the full picture if there are nonlinearities.

Models Without PCA

Our team also ran the models without PCA and as depicted in the graph below. These results indicate models using  Ridge, Lasso, ElasticNet, and Random Forest perform better without PCA best in predicting housing prices with a Root-Mean Square Error of approximately 0.024 - 0.029.

Finally, once we optimized our models based on the training data, we ran the Kaggle test set through the algorithms to predict the sale price. Overall, the ridge model provided the most accurate predictions.



From this project, we gained several key insights and ideas for improving the model in the future. While implementing PCA in the model did not always provide improved predictions, it may have improved results if we are able to identify any nonlinear relationships within our data set. We believe the grid search to optimize the input parameters of each model improved the accuracy and could be further improved with additional tuning among a broader range of parameters.


Additional regression models could also be implemented into the algorithm to optimize results as well as model stacking or ensembling techniques to incorporate several models into the prediction. Other considerations may include investigating other data that could influence the prices, such as how long the home was on the market and whether it has a desirable layout. It would also be interesting to see how these models perform based on home sales outside the dataset, for years after 2010.


About Authors

Ariani Herrera

Demonstrated passion for learning and developing solutions to complex business problems. Skilled in Sales, Business Development, Finance, Alternative Investments, and Equities. Strong analytical professional with a Bachelor’s Degree in Applied Mathematics from Columbia University in the City of...
View all posts by Ariani Herrera >

Erin Dugan

As an engineer with a strong background in R&D and acoustics, Erin enjoys finding creative ways to interpret and communicate complex information, whether it's for product development or project design. She holds a Master of Engineering Management (MEM)...
View all posts by Erin Dugan >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI