Feature House Prices: Advanced Regression Techniques
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Intro
In this blog, I will be discussing my procedure in the Kaggle competition Housing Prices: Advanced Regression Techniques. The goal of this competition is to predict the sale price of houses in Ames, Iowa, given 79 explanatory variables, which are describe here.
The full code and data for this article are available in my Github.
Read Data
Concatenate training & test features
I concatenated features so I wouldn't have to impute missing values, transform features, etc. I also removed houses with ground living area greater than 4,500 square feet from the training sets.
SalePrice Distribution
SalePrice Distribution
Impute missing values
The plot below shows the number of missing values in columns with at least one missing value.
Engineer features
I created new features for the dataset. The features were TotalSF, TotalPorchSF, and TotalBath. The code is shown below:
Categorize MSSubClass and YrSold
Looking at the description below, the levels for the MSSubClass don't seem to have a natural order. So I decided to represent the MSSubClass as a categorical feature order rather than a numerical feature. Also decided to represent YrSold as a categorical feature because it allowed for a more flexible relationship with SalePrice.
Transform features
To better highlight the recurring patterns in SalePrice, I transformed the MoSold feature using the code below:
I also transformed the highly skewed features using the code below
And I used pd.get_dummies to convert all categorical values into dummy variables.
Removing outliers from training data
I fitted a linear model to the training data and removed examples with a studentized residual greater than 3.
Define random search
The code below defines random search as a function, and I used random search to optimize hyperparameters for each of our models. I also used a 5-fold cross validation to score each iteration.
Model Scores
Overall the models did well with Gradient Boosting performing the best. These are the scores for each model performance:
- Ridge: 0.0778
- Lasso: 0.0796
- SVR: 0.0712
- LGBM: 0.0640
- GBM: 0.0436
Creating Predictions and RSME
I stored the predictions of the based learners and stacked ensemble in a list. Then I averaged the predictions and gave a weight of 0.13 to the based learners and 0.35 to the stacked ensemble. My RSME score was 0.3848232.
Conclusion
Overall the models seemed to perform well, but the RSME score seemed a little bit high, and this might be due to an error in the code. So in the future I will see what I can do to improve on the RSME.