Kaggle Competition: House Price Prediction 2017
Introduction
Exploratory Data Analysis
Let's first do EDA to gain some insights from our data. Let's plot the distribution of sale price (target). Figure shows that only a few houses are worth more than $500,000.
These are some basic EDA for our house price data set. In the next section, we are going to perform feature engineering to prepare our train set and test set for machine learning.
Featuring Engineering
We consider numerical and categorical features separately. The numerical features of our data set do not directly lend themselves to a linear model and the features violate some of the necessary assumptions for regression such as linearity, constant variance or normality. Therefore, we perform log(x+1) transformation for our numerical features, where x is any numerical feature, to make numerical features more normal.
Also, it is a good idea to scale our numerical features.
For categorical features, we perform several transformations as summarized below.
-- Fill NA using zero.
https://gist.github.com/Wann-Jiun/b9d371f150fbd591356cba7d08da652d
-- Group by neighborhood for linear feet of street connected to property and fill NA using the median of each group (neighborhood).
https://gist.github.com/Wann-Jiun/dd5f183af39023c43f760f24befb4e12
-- Transform "Yes" and "No" features such as having central air conditioning or not to one and zero, respectively.
https://gist.github.com/Wann-Jiun/1e965ec9e6501495758355fe20edc2cd
-- Using "map" to transform quality measurements to ordinal numerical features.
https://gist.github.com/Wann-Jiun/ed15a3f4e9a3af50a0c967480f25c9ab
-- Perform one-hot encoding on nominal features.
https://gist.github.com/Wann-Jiun/040d5d0762202e995e32b5b97e658760
-- Sharan's three strategies. [Add the strategies here]
We also generate several new features summarized below.
-- Generate several "Is..." or "Has..." features based on whether a property "is..." or "has...". For example, since most properties have standard circuit breakers, we create a column "Is_SBrkr" to characterize those properties having standard circuit breakers.
https://gist.github.com/Wann-Jiun/a538c1f9df20cf7cd0a50d2c7862d4bb
-- Generate some aggregated quality measures to simplify the existing quality features. We aggregate those features into three broad classes, bad/average/good, and encode them to values 1/2/3, respectively.
https://gist.github.com/Wann-Jiun/fc9f33c64f124e41e6b6505dccbd2c9f
-- Generate features related to time. For example, we generate a "New_House" column by considering if the house was built and sold in the same year.
https://gist.github.com/Wann-Jiun/707c2d215928c890c1ee5b28aea1aae4
For dealing with outliers, we filter out the properties having a living area of more than 4,000 square feet above grade (ground).
There are also some minor features considered here. The total number of features is 389 and we have 1456 and 1459 samples for the training and test sets, respectively. Now, let's do machine learning.
Ensemble Methods
We consider six machine learning models: XGboost, Lasso, Ridge, Extra Trees, Random Forest, GBM.
For each models, we perform grid search with cross-validation to find the best parameters for the corresponding models. For example, for Kernel Ridge,
https://gist.github.com/Wann-Jiun/77498ab48990883a22812cd0c1ab7036
We found that Random Forest, GBM, and Extra Trees have serious overfitting problem.
Finally, we use an ensemble model which consists of Lasso, ridge, and XGboost with equal weights as our model.
https://gist.github.com/Wann-Jiun/67b39c77e8bbb6f37926877f51b56328
Stacking
We consider out-of-folder stacking. At the first level, we use XGboost, Random Forest, Lasso, and GBM as our models. At the second level, we use the outputs of the models from the first level as the new features and use XGboost as our combiner to train our model. We perform cross-validation for each model to find the best set of parameters.
https://gist.github.com/Wann-Jiun/d5e6f55682eb5ef21f8cd2e46c0b6cc1
Feature Selection
We consider feature importance provided by XGboost to select the important features.
https://gist.github.com/Wann-Jiun/8daf47909b620a48a5d3cda17462f610
https://gist.github.com/Wann-Jiun/b1121ab43b29235cb795099ec79a18cc
We use a loop to see how the score varies with different number of features included in the training set and set a threshold to determine which features we want to drop from the data set.
Conclusions
In two weeks (two people, part-time), we have done EDA, feature engineering, ensembling, stacking, and feature selection. We observed that there is a huge score jump from the score without featuring engineering to the one with feature engineering. The second score jump is from the score without ensembling to the one with ensembling. Out of folder stacking didn't improve the score too much. It may be because the models are already statistically equivalent. Since the data set is very small, to improve the prediction score, We can consider different featuring engineering such as using different distributions to create different features or using feature interaction to generate new features automatically.