Predicting House Price in Ames, Iowa Using Machine Learning
Origin of dataset, categorical vs. numerical variables, objective
The Ames Housing Dataset was introduced by Professor Dean De Cock in 2011 as an alternative to the Boston Housing Dataset. It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010.
There are 20 continuous (numerical), 14 discrete (numerical), 23 nominal (categorical), and 23 ordinal (categorical) features describing each house’s size, quality, area, age, and other miscellaneous attributes.
For this project, our objective was to apply machine learning techniques to predict the sale price of houses based on their features. This is part of the data science competition on Kaggle. Users are challenged at minimizing Mean Squared Log Error (MSLE) on a test set with target values withheld from the publicly available data set, using machine learning algorithms.
II. Data Exploration
Missing values, skewness, multicollinearity, feature importance
We took 6 distinct steps to accomplish data exploration, cleaning, transformation and engineering, respectively, to get dataset ready for modeling part.
Item 1: Sales price (response variable has outliers), and distribution of sales price is highly skewed.
Solution: (1) We removed the outliers, and (2) We take log transformation of sales price and replaced log(sales price) as our new response variable.
Item 2: Understand correlation and multicollinearity among predictive variables.
Solution: (1) First of all, we calculated the correlation between sales price and variables and listed them in descending order:
(2) Next, we created correlation heatmap to understand the correlation among all variables (only showing selected variables here due to file size):
Item 3: Dealing with (lots of) missing values. Large quantities of missing data are accounted for in the documentation associated with the data set.
Large quantities of missing data are accounted for in the documentation associated with the data set. While standard missing values indicate that the feature is not present in the house, there could be other reasons why certain values are missing. It is important to identify the reason for missingness and fill in missing values to the best of our judgement.
Solution: We categorized our dealing with missing values into the following buckets:
(a) NaN = None (categorical variables)
Based on data description, we found out that NaN actually means there is none (or 0). We then filled in NaNs with None.
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond','BsmtQual’, 'BsmtCond', 'BsmtExposure', 'BsmtFinType1’ etc.
(b) NaN = 0 (numerical variables)
Based on data description, we found out that NaN actually means there is none (or 0). We then filled in NaNs with 0.
(c) NaN = Assume it is typical
Based on data description, NaN actually means it belongs to the category “typical”.
(d) Missingness does not imply none/0, data description does not help to infer the exact meaning of missingness, but we can impute from existing data and existing information.
'LotFrontage’ is dependent on the zoning district in which a property is located (i.e. ‘Neighborhood’). Therefore, we grouped by Neighborhood & LotArea and fill in missing value by the median LotFrontage of each neighborhood.
(e) Value still missing:
Next, after filling in and/or imputing missing values, we observed how many NaNs still exist. Luckily, there are very few missing (<5%) for each of the variables that still have missing values. Given the small percentage, we decided to fill in missing values by taking the most common value.
No missing values now! Here is a recapture of our solutions to missingness:
Item 4: Some numerical variables are highly skewed. We checked for numerical features where abs(skewness) is > 0.5 and found out that there are 25 highly to moderately skewed numerical variables:
Solution: By convention, abs(skewness) >1 implied highly skewed, and 1 > abs (skewness) > 0.5 implied moderately skewed. We took a conservative approach and applied box-cox transformation to those numerical variables that are considered to be highly to moderately skewed. This effectively reduced the number of skewed numerical to 16 from 25:
Item 5: Understand feature importance.
Solution: Different models select feature importance based on how well the model is able to capture the relationship between the predictive and response variable. Below is a feature importance computed from a random forest model on the engineered data set:
Item 6: Data Engineering.
Solution: We observed high correlations between variabls that are related to areas/SF (square foot) and decided to create 3 new area/SF related variables:
We ended up dropping these 3 newly engineered variables for our final model submitted, as they alone did not improve our score and/or reduced our RMSE.
This concludes our data pre-processing part and now our data is ready for modeling.
III. Predictive Models
Ridge, Lasso, ElasticNet, Random Forest, GDBoost, LightGBM, XGBoost
(1) Machine learning algorithms candidates:
We selected both linear and non-linear based machine learning algorithms as our model candidates:
Each models comes with its inherent advantages and disadvantages:
(2) Individual results & parameter tuning (auto vs. fine tuned parameters)
Here, we compared each model's performance (using mean cross validation score, model score, and RMSE as criteria) under different tuning methods. We found out that on average, for same model, the results from manually fine tuned parameters performed better than auto parameters. Noticeably, manually fine tuned parameters take significantly more computing power and resources.
An additional observation is that Lasso alone performed extremely close to ElasticNet, making us wonder if it makes sense to remove ElasticNet as it seemed to pick Lasso alone over Ridge, instead of striking a balance between Lasso and Ridge.
(3) Model stacking (different stacking methods)
There are different methods to stack different models. We experimented the following 3 methods:
(i) Stacking using StackingRegressorCV: an ensemble-learning meta-regressor for stacking regression);
(ii) Simply averaging across all models; and
(iii) Weighted average: assigning more weight to those models with lower RMSE
We used each of the 3 stacking methods above with and without including ElasticNet mode, and we measured the stacking performance on RMSE on both training data and test data.
Using StackingRegressorCV without ElasticNet model gave us the highest test score and became our final stacking method.
IV. Results and Conclusion
Stacking models, Kaggle results, Thoughts
Our final submission gave us a RMSE of 0.11582 on Kaggle's own test dataset, which positioned us in the top 15% of all the candidates as of July 30, 2019.