Housing Prices in Ames, Iowa – a Machine-Learning-Project
Introduction
‘Machine Learning’ is the science of programming computers so they can learn from data and its use has the potential to change the way we work and solve problems. For such a powerful tool many people are still unaware of how it works and why it’s so useful. This project, a kaggle competition focusing on predicting housing prices, is an end-to-end example of how using Machine Learning techniques can improve our ability to harness raw data towards a productive purpose. We hope this project inspires the reader to dive deeper into the subject and brainstorm other ways these methods can be used – the sky’s the limit.
What follows is the combined work of Kenneth Colangelo, Sheetal Darekar, Marissa Joy, and Merle Strahlendorf.
The Project
Kaggle is a platform for data science competitions and is a great place to find datasets, solve difficult problems, and communicate about different analytical techniques. The original competition can be found here: House Prices: Advanced Regression Techniques. Our aim was to predict residential housing prices (the target variable) by using 79 different explanatory variables ranging from what street the house was on to its square footage.
We began the project by outlining our work flow:
- Understanding our data
- Exploratory Data Analysis (EDA) and Preprocessing
- Feature Engineering
- Modeling
The competition provided a dataset of 2919 homes, 1460 in our training set (these included their sales price). Of the 79 variables we grouped them into categorical variables, which included ordinal and nominal, and numerical variables, which included continuous and discrete. To get a head start on how these might relate to housing prices, we decided to contact someone in the industry, a real estate sales person, who provided us with an example of a listing. Equipped with this information we began to dig into the data.
We graphically expressed the relationship between the sale price and each of our variables. We wanted to make sure we understood how each categorical variable affected price so that when we moved to feature engineering we would be modeling the correct relationship (positive or negative). We noticed that our data was missing a number of values throughout the set. So in preprocessing our goal was to deliver a clean set of data.
We looked at the definitions of our variables to decide whether an N/A was a category in its own right or just a missing value. Once resolving the N/A categories we began filling in values based on the mean or mode of houses that had similar features. We even used the neighborhood variable to estimate missing plot size.
This was raw data, flaws and all, so we knew it was important to keep as much information as possible, so our models would be able to find the true relationships between variables. We also realized that sale price is skewed to the left (see upper graph) to adjust the target variable we took the log of it (see lower graph).
Once we had taken care of our missing values, we moved on to feature engineering and began by looking at the correlations between our variables. What kinds of interactions did we see? Did variables overlap? Which variables were most related to the sale price? We found a few that jumped out including the overall quality and total square footage.
These were variables we had talked about in the beginning, and sure enough would be important to our model. In feature engineering the purpose is to combine and enhance variables to better capture their relationship to the target. We took time to combine a number of variables like square footage, so we could assess the variable not in pieces but as one.
In feature engineering we took time to look at outliers and decide what data was anomalous, in the end we removed five houses, considering them too unique to be included in our model. Once had decided on which variables we thought would structure our data we moved to model.
Modeling
We decided to split our train dataset with the train_test_split model from the sklearn model selection package and decided on a / split.
The first model we fitted was a simple multi linear regression model. Mostly to see how the data would respond. The R2 score we got for this model was .9475 which means that multiple linear regression explains 94.75 percent of our response variable (sales price). Nonetheless the RMSE (root mean square error) – in which we measure the fit of the model – is only .128 which is the worst out of all our models.
The next model we tested was a lasso regression model. Before using Grid Search, also from the sklearn model selection package, we were interested to see if we can write our own function to find the best alpha for different regression models. By creating a for loop that checks each iteration if the RMSE is smaller than the one prior to that. With our own alpha-function we got 0.001 as our best alpha and .1171 for the RMSE. The next attempt was using KFold and Cross_Val_Score, which gave us the same alpha but a RMSE of .1154.
Finally we used Grid Search in combination with a function called RobustScaler, which is a scaler that uses statistics that are robust to outliers by removing the median and scaling the data according to the quantile range. The lasso regression with an alpha of 0.001 returned a RMSE of .1153. This model got a RSME of .12511 on kaggle.
After lasso regression we tried ElasticNet. Again we searched for the best alpha and minimal RMSE with our own function, Cross_Val_Score plus KFold and Grid Search. An alpha of 0.001 was the best with our function as well as with the Cross_Val_Score. For our function we got an RSME of .1175 and with Cross_Val_Score it was .1144. Grid Search gave us an alpha of 0.00932 and a L1 ration of 0.01 with this we got a RSME of .1138. On kaggle the RSME was .12350.
Last but not least we ran a Gradient Boost Regression (GBM). We set the n_estimator to 3000, the learning rate to .005 and the maximal depth to 20. This gave us a RSME of .117.
Conclusion
The first group project gave us a taste of how to work together as a team. Looking back on the project flow, we spent a lot of time conducting EDA, imputing missingness, and feature engineering which provided us with a good overview of the data we had; however, due to the time constraints of the project we were not able to fully flesh out regression models. Even though R provides a lot more easy visualization tools for missingness, EDA, and regression we decided to take the challenge and code purely in Python, which retrospectively took a larger amount of time than expected.
You can find the code to our project here.
Thank you very much for reading!