Predicting Housing Prices with Machine Learning
In this project our goal was to build accurate models to predict sale prices for houses in Ames, Iowa. We completed an entire machine learning project workflow, from exploring the data to ending with a presentation we made to our class. Our team name was Training Day, named after the movie as well as a play on training models that we would use in our project. As part of a Kaggle competition we had to score our models based on the guidelines given, which we will discuss at the end.
We began with exploratory data analysis (EDA) and we found that this helped us to understand our dataset and a lot of the relationships between different features within it. One of the first things we wanted to understand was the relationship between our “SalePrice” variable and other features. During our EDA we decided to use a heatmap to understand the linear correlations between “SalePrice” and our other features better.
We also found that there were outliers in our dataset. As you can see from the visualizations below, we visually identified that there were outliers and then subsequently removed them. We did this because outliers could lead to misrepresentation of averages in our data, which could cause our models to overfit.
Another reason it was important to understand and remove outliers, is because the dataset is skewed, outliers would have exacerbated this skew. Skewness is the asymmetry of a distribution inherent in a dataset, in this case our data had a rightward skew, to deal with this we decided to use a log transformation as shown below.
In addition to the things we discovered above, there was a lot of missing data. Many of the features having very large amounts of missing data, gave us an early understanding of how we should approach feature engineering which we will address in more detail later. This part of EDA naturally led into missingness and imputation, where we replaced many of the columns that contained “NA” with 0 or None depending on the column. We also replaced some of the 0’s with mean/median, in addition to dropping many features. The features that we ended up dropping include: “Id”, “PoolQC”, “Utilities”, “MiscFeature”, “MiscVal”, and “Alley”.
Along with how we dealt with missingness and imputation which we just talked about, many of the non-numerical categories were dummified or given ordinal values. Dummification is essentially converting categorical values into dummy variables that can then be used and manipulated much easier. After dealing with these issues it was much easier to complete our feature engineering.
For our feature engineering we decided to create new features, these included “Total Baths”, “Total SF”, median price by year, whether it was remodeled or not (1 or 0), and one for if the house was new called “New House”. Redundant columns were then dropped, we also found that “Neighborhood” was one of our most important and strongly correlated features. We decided to make a box plot based on “Neighborhood” and partitioned it into eighths because it had a 0.74 correlation with “SalePrice”, the third strongest of any feature.
When finished, we built machine learning models. We used GridSearchCV for our hyperparamter tuning and then proceeded to use the following models: Ridge, regular linear regression, lasso, and we even tried a few others like XGBoost but decided to discard them. In scoring our model we used the Kaggle competition guidelines and evaluated it based on Root-Mean-Squared-Error (RMSE). This means that our score accounts for the logarithm of the predicted value and the logarithm of the observed sales price, measuring our error in prediction between the two. Here is an example of the formula below:
Ridge performed moderately well compared to our other models with an RMSE of 0.1109. Lasso selected 54 variables and eliminated 131, but did not perform well in comparsion with an RMSE of 0.1083. Our best performing model was also the simplest, regular linear regression had an RMSE score of 0.1308.
For future work, if given more time we would have included outside datasets and/or sources. We wanted to introduce the Ames, Iowa housing price index. More domain knowledge such as which types of roof is more valuable would have also been very beneficial. Investigating if there was a seasonality effect of students leaving during school breaks, since they were a large percentage of the population in Ames. Incorporating economic data would have also been very interesting similar to the graph obtained from FRED underneath is an example of this:
In conclusion, our models seemed to perform fairly well and we were happy with our progress, we did not utilize and of the higher-end or more advanced models and still thought we had good results overall. We learned a lot throughout the process and would like to continue to develop our machine learning skills doing similar projects in the future.