Housing Prices in Ames, Iowa – a Machine-Learning-Project

Introduction

‘Machine Learning’ is the science of programming computers so they can learn from data and its use has the potential to change the way we work and solve problems. For such a powerful tool many people are still unaware of how it works and why it’s so useful. This project, a kaggle competition focusing on predicting housing prices, is an end-to-end example of how using Machine Learning techniques can improve our ability to harness raw data towards a productive purpose. We hope this project inspires the reader to dive deeper into the subject and brainstorm other ways these methods can be used – the sky’s the limit.

What follows is the combined work of Kenneth Colangelo, Sheetal Darekar, Marissa Joy, and Merle Strahlendorf.

The Project

Kaggle is a platform for data science competitions and is a great place to find datasets, solve difficult problems, and communicate about different analytical techniques. The original competition can be found here: House Prices: Advanced Regression Techniques. Our aim was to predict residential housing prices (the target variable) by using 79 different explanatory variables ranging from what street the house was on to its square footage.

We began the project by outlining our work flow:

  • Understanding our data
  • Exploratory Data Analysis (EDA) and Preprocessing
  • Feature Engineering
  • Modeling

The competition provided a dataset of 2919 homes, 1460 in our training set (these included their sales price). Of the 79 variables we grouped them into categorical variables, which included ordinal and nominal, and numerical variables, which included continuous and discrete. To get a head start on how these might relate to housing prices, we decided to contact someone in the industry, a real estate sales person, who provided us with an example of a listing. Equipped with this information we began to dig into the data.

We graphically expressed the relationship between the sale price and each of our variables. We wanted to make sure we understood how each categorical variable affected price so that when we moved to feature engineering we would be modeling the correct relationship (positive or negative). We noticed that our data was missing a number of values throughout the set. So in preprocessing our goal was to deliver a clean set of data.

We looked at the definitions of our variables to decide whether an N/A was a category in its own right or just a missing value. Once resolving the N/A categories we began filling in values based on the mean or mode of houses that had similar features. We even used the neighborhood variable to estimate missing plot size.

This was raw data, flaws and all, so we knew it was important to keep as much information as possible, so our models would be able to find the true relationships between variables. We also realized that sale price is skewed to the left (see upper graph) to adjust the target variable we took the log of it (see lower graph).

                      

Once we had taken care of our missing values, we moved on to feature engineering and began by looking at the correlations between our variables. What kinds of interactions did we see? Did variables overlap? Which variables were most related to the sale price? We found a few that jumped out including the overall quality and total square footage.

These were variables we had talked about in the beginning, and sure enough would be important to our model. In feature engineering the purpose is to combine and enhance variables to better capture their relationship to the target. We took time to combine a number of variables like square footage, so we could assess the variable not in pieces but as one. 

In feature engineering we took time to look at outliers and decide what data was anomalous, in the end we removed five houses, considering them too unique to be included in our model. Once had decided on which variables we thought would structure our data we moved to model.

Modeling

 We decided to split our train dataset with the train_test_split model from the sklearn model selection package and decided on a / split.

The first model we fitted was a simple multi linear regression model. Mostly to see how the data would respond. The R2 score we got for this model was .9475 which means that multiple linear regression explains 94.75 percent of our response variable (sales price). Nonetheless the RMSE (root mean square error) – in which we measure the fit of the model – is only .128 which is the worst out of all our models.

The next model we tested was a lasso regression model. Before using Grid Search, also from the sklearn model selection package, we were interested to see if we can write our own function to find the best alpha for different regression models. By creating a for loop that checks each iteration if the RMSE is smaller than the one prior to that. With our own alpha-function we got 0.001 as our best alpha and .1171 for the RMSE. The next attempt was using KFold and Cross_Val_Score, which gave us the same alpha but a RMSE of .1154.

Finally we used Grid Search in combination with a function called RobustScaler, which is a scaler that uses statistics that are robust to outliers by removing the median and scaling the data according to the quantile range. The lasso regression with an alpha of 0.001 returned a RMSE of .1153. This model got a RSME of .12511 on kaggle.

After lasso regression we tried ElasticNet. Again we searched for the best alpha and minimal RMSE with our own function, Cross_Val_Score plus KFold and Grid Search. An alpha of 0.001 was the best with our function as well as with the Cross_Val_Score. For our function we got an RSME of .1175 and with Cross_Val_Score it was .1144. Grid Search gave us an alpha of 0.00932 and a L1 ration of 0.01 with this we got a RSME of .1138. On kaggle the RSME was .12350.

Last but not least we ran a Gradient Boost Regression (GBM). We set the n_estimator to 3000, the learning rate to .005 and the maximal depth to 20. This gave us a RSME of .117.

Conclusion

The first group project gave us a taste of how to work together as a team. Looking back on the project flow, we spent a lot of time conducting EDA, imputing missingness, and feature engineering which provided us with a good overview of the data we had; however, due to the time constraints of the project we were not able to fully flesh out regression models. Even though R provides a lot more easy visualization tools for missingness, EDA, and regression we decided to take the challenge and code purely in Python, which retrospectively took a larger amount of time than expected.

You can find the code to our project here.

Thank you very much for reading!

About Authors

Merle Strahlendorf

Merle graduated with a B.S. in Business Administration from the University of Hamburg in Germany in 2015. Since then she has worked as a US Correspondent for multiple German magazines focusing on the marketing, media and communications sector....
View all posts by Merle Strahlendorf >

Marissa Joy

Stony Brook applied math and economics graduate turned Data Scientist who enjoys digging deep into data of all types and sizes. She is skilled in data wrangling and data visualization using Python, R, and SQL. Some of her...
View all posts by Marissa Joy >

Kenneth Colangelo

Ken graduated from Cornell in 2007. After spending 10 years working in the Global Economics Department at AllianceBernstein, he enrolled in the Data Science Bootcamp to learn alternative techniques to data analysis. 'Big Data' and 'Machine Learning' have...
View all posts by Kenneth Colangelo >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI