Machine Learning Project III ~ Ames, Iowa Housing Prices
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Ames, Iowa ~ Machine Learning Project III
By Aaron Festinger, Xingwang Chen, Yan Mu & Alex Guyton
The buying and selling of a home is something many of us will face at some point in our lifetime. And while we may have an intuitive feel as to how much a house is worth, based on our past experience, the number of bedrooms and bathrooms, or the neighborhood it is in, it would be useful for buyers and seller if there was a systematic, mathematical approach in determining how much a house is worth. The Ames, Iowa Kaggle competitions offers a data set with good sample size, and is widely used.
We are presented with 2 data sets, the training data set and the testing data set. For this project, we train a number of machine learning model that uses the features and attributes of house to predict the target, the price of the house. We examined the features to determine which features are importance and which are not, developed multiple machine learning models, and compile the results to take advantage the strongest point of the different models.
Exploratory Data Analysis
Exploratory Data Analysis: Target variable analysis: To use the data in any meaningful way, we must first understand the data. The data set included a short description of each feature, but additional knowledge was needed to determine whether the feature would be important or necessary to the training of our model. The exploratory phase of data involves examine the raw data and understand the target and gain insight to relationship between the two.
The first object is to understand the target variable, which is the Sale Price of the house. A histogram shows left-skewed distribution curve (see fig below), with most of the houses being sold at the $100,000 to $200,000 range. In order to shift the sale price to a more normal distribution, the log Sale Price was used (see fig below) in place of the sale price.
The features can be separated into 3 categories, numerical, ordinal category, and nominal category. We did a heatmap of the numerical data against the sale price to see visualize both the strong and weak correlators. Amongst the highest correlations were ‘OverallQual’, ‘GrLivArea’, and ‘GarageCars'. The correlation can also indicate feature that might contribute multi-collinearity, such as ‘GarageCars’ and ‘GarageArea’, as well as ‘YearBuilt’ and ‘YearRemodAdd’.
Multi-collinearity features can contribute to overfitting depending on the model. Next we look at a number of features that showed high correlation with the sale prices, such as the ‘GrLivArea’, ‘GarageCars’, and ‘YearRemodel'. All three showed a linear relationship with the sale price.
Only a few columns had any missing values, but of those, some were missing 90% or more of the entries. Looking at the top three missingness columns, we can see that the high rates of missingness are due to the feature in question not being relevant or present. Most columns do not have a swimming pool, so swimming pool quality is expected to be blank. The same goes for fences and alleys.
These missing entries merely need a placeholder value to represent “none,” so we imputed zeroes. On further analysis, this prescription was found to be appropriate for all of the columns with missingness but for “LotFrontage,” which is common to all properties, but also missing nearly 1 time in 7. This required a careful choice of imputation technique so as to preserve the statistical integrity of the variable.
Because lot frontage was found to be characteristic of the neighborhoods, and many neighborhoods were found to have either a single very common value or a narrow distribution of common values, we resolved this by imputing the mean lot frontage value for the neighborhood of the property.
Processing the Data
In order to answer the above questions, I extracted the age of the whiskey from the titles of the whiskey reviews, which yielded 1,359 whiskey ages (slightly more than one third of the whiskeys). I also extracted and analysed the text from all of the reviews, and filtered out filler words and neutral/non-tasting related text. Next, I examined the relative distributions of reviewers, categories, review words, scores, prices, and age, and looked for important correlations.
As mentioned in the Introduction section, Lasso, Ridge, ElasticNet, Random Forest, Gradient Boosting, and XGBoost models were trained, and the trained models were used to create a stacked model. The optimal hyperparameters of each model were tuned using GridSearchCV and Random SearchCV from the scikit-learn package in Python. This approach trained many models with cross-validation, using a limited number of random combinations from selected ranges of hyperparameters.
Tuning parameters is a very important component of improving model performance. A poor choice of hyperparameters may result in an over-fit or under-fit of the data for the model. We can adopt three different methods in tuning hyperparameters
Elastic-net Elastic-net is a form of regularised linear regression that naturally selects for significant features. It relies on an α-parameter, which directly penalises estimator coefficients, and a L1-ratio parameter, which determines the balance between the elimination of non-significant features versus the mere reduction of coefficient size. Tuning elastic-net gave us an RMSE of 0.1124.
XGBoost Taking our progress with decision trees a step further, we implemented XGBoost, which introduces a regularising γ-parameter to control the complexity of tree partition. The higher the γ, the larger the minimum loss threshold to split the tree at each leaf node. XGBoost gave an RSME of 0.0628. This much superior result raised our suspicions, provoking us to apply it to the test set, where it outputted RSME 0.1154. The large difference in error made a clear case of severe overfitting.
The result showed linear models are more robust than non-linear models, they are constantly producing better results.
In order to get the most out of each of our models, we need to combine them. The simplest way to do this would be to average the predictions, giving them each equal weight. This approach has the advantage of simplicity, but it fails to use the varying accuracy of the models to the greatest advantage. In order to get the best results from our combined models, we want to be able to tune the relative weights so as to get the greatest possible accuracy.
If we set up the problem as a linear combination with varying weights, it becomes a simple linear regression problem, with the features being the models, and the dependent variable being the actual sale prices. This approach enables us to combine several models with given rmses and extract a prediction with a lower rmse than any one of the input models could achieve. This works best when we have several models with different strengths and weaknesses, rather than having all of the models be of one type. In our case, we have both tree-based, and linear models to work with.
We found that our model selection was essential in finding the proper algorithm for exploiting our data set in this project. A close second was the importance of data cleaning & and feature selection, where different methods strongly affected the accuracy of our machine learning predictions. In this scenario, with the data set given, we did find that post data cleaning, ridge regression yielded the best results for our model.
- Data preprocessing (skew transform, outlier remove, feature engineering etc) was key in training an accurate model.
- Linear models tended to work best for prediction purposes, but easy influenced by outliers.
- Tree models were less prone to outliers, but had a tendency toward overfitting.
- Stacking reduce aforementioned issues, and yielded rmore accurate results in this case.
We exercised a variety of statistical learning techniques, examined their efficacies, and designed a stacked model to boost predictive performance. Still, there are a few areas we can think of for improvement.
- Spending more time on feature engineering is always welcome. Given more time, we’d enjoy creating more variables and data subsets to improve our models even further, including feature scaing.
- In addition, we also would like to further explore the neighborhood effects on couple important features we identified using hierarchical linear regression. We had a theory that the clustering of the neighborhoods had a bigger role on the house price than we initially thought, and using hierarchical linear regression would have helped prove our theory right or wrong.
- We had in mind for future improvements was the inclusion of time series event data to be included in our dataset.
- The selection, and tuning, of a meta-model and its constituents is more art than science. We could experiment with a greater breadth of algorithms to ensemble more uniquely diversified base models, each of which we would better know to capture decorrelated characteristics in our data, continue tune the model and implement a wider palette of stacked models.