Studying Data to Predict Housing Prices
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. He proposed this dataset to data scientists as an alternative to the often cited Boston Housing dataset.
The Kaggle competition challenges (aspiring) data scientists to predict the final price of each home. This is not going to be an easy task. The data set has 70 explanatory variables that describe different aspects of a residential home in Ames, Iowa.
Predicting house prices is a challenging endeavor as there are a multitude of factors and variables the need to be taken into account when it comes to real estate valuation. In this challenge, creativity and ingenuity is key to be able to find the important variables that can give the most accurate prediction of house prices.
Predicting housing prices is a challenge; the task involves many variables and takes some creative thinking to pinpoint the features that actually matter in order to arrive at an accurate prediction. A recent competition on Kaggle (an online data science platform) featured housing price predictions with data from Ames, Iowa.
Visit our repository for all related codes.
Data cleaning and processing
The Data Set
The Kaggle competition provided the data as as train.csv (with house prices), and test.csv (new observations with all the same features as train.csv, but missing the Sale Price).
We processed the train.csv dataset, performed feature engineering by a number of methods and optimized model hyperparameters using K-fold splitting, performed a stacking algorithm of different models based on train/test optimization.
We predicted the house prices for the houses listed in the test.csv file and saved it as the required format for Kaggle submission.
- Top columns with null values:
With first glance we noticed that a lot of features have None value. However, with further analysis it shows that most cells with “NA” values actually mean “No”.
Example : PoolQC - Pool Quality - “NA” means “No Pool”.
With this finding, we updated all meaningful “NA” values to “No” or numeric 0
Two houses with the most SF(square feet) had average sale price. Based on the data publisher suggestion we decided to remove these two outliers.
Dummify Categorical Features
- Nominal category features:
- Dummy with removal of most common value
- Ordinal category features:
- Example: For KitchenQual (Kitchen Quality) feature, the original data values are Po(Poor), Fa (Fair), TA (Typical/Avg), Gd (Good), Ex (Excellent). We converted the values to numeric 1,2,3,4,5
Based on our understanding of the dataset , we have also implemented the following strategy to create some meaningful new features:
- Create BsmtScore
- (BsmtType1 * SF1 + BsmtType2 * SF2) / (SF1 + SF2)
- Create GarageScore (GarageQual/GarageCond/GarageFinish too correlated)
- (GarageCond + GarageQual + GarageFinish) / 3
- Define TotalBath
- Convert “Year Built” to “Years Since” concept
- YearsAgoBuilt, YearsSinceRemodel, YearsSinceSale
- Convert SaleMonth (1-12) to Seasons
- Dec-Feb = Winter, Mar-May = Spring, Jun-Aug = Summer, Sep-Nov = Autumn
- Features with less than 4 values:
- After dummy all the nominal category features, there are 158 columns. 19 features have less than 4 values in the training dataset and were removed to reduce statistical noise
- Check Correlation of each feature. Used 0.70 as the threshold to remove features that may cause multicollinearity problem. Features removed are:
- Other special features need to be removed:
- GarageYrBuilt-- Age of garage: Impossible to reconcile homes without a garage, as this is a numeric, not categorial, value.
- Misc Feature -- MiscFeature and MiscValue (value of Misc Features)
- MSSubClass -- Can be directly inferred by combination of HouseType, BldgType, and YearBuilt. Often highly (>0.9) correlated with HouseType
Data type checking
After all data cleaning work. We are ready to move to the feature engineering work. The last thing is to make sure all data now are in numeric type.
Data on Feature Engineering
Three models were used for feature engineering:
- AIC - forward selection, we got a 64-feature list
- Lasso - lambda was decided by using 5-fold cross validation. With penalty we got a 110-feature list
- Boost - based on Gradient Boost, we selected the most important 70-feature
In the next step, we manually compared the 3 feature lists. We kept the features which are listed as important or significant in all lists. For features not exist in all three lists, we trained the model with/without the features. Then compared the training result (RMSLE).
We finalized a 61-feature list. For details please refer to appendix table 1
Modeling and hyperparameter optimization:
Data was training using five models to get the prediction of the house prices:
- Linear - K-Fold testing of multiple linear regression
- Elasticnet - 5-fold Cross Validation was used to get the optimal alpha and rho hyperparameters balancing Ridge/Lasso regularization algorithms
- Ridge - 5-fold Cross Validation was used for lambda hyperparameter choosing, numeric features are all standardized to avoid large number impacts of penalization
- RandomForest -- K-Fold testing and hyperparameter optimization using the RandomForest Regressor algorithm from sklearn
- GBoost_AIC - K-Fold hyperparameter optimization and modeling using GradientBoosting Regressor from sklearn
Model stacking and train/test comparison:
Stacking model, where a 24% XGBoost, 36% Gradient Boost, 20% Random Forest, and 20% linear stacked model was found to be best against the test set.
Based on the training and test result. We noticed that there is an overfitting issue.
Because during feature engineering phase, a lot of manually comparison work was conducted to optimized the training score (training RMSLE). This can be the main reason of the overfitting.
We decided to simplify the feature selection process. Instead of using 3 models we focused the feature engineering work with AIC model only. In order to get a competitive feature list, we decided to use “both selection” procedure ( Begin with either a model that only has an intercept term or a model that includes all parameter terms, then sequentially add or remove the predictor that has the most/least impact on the model, respectively).
In this way, we finalized a 71-feature list (Appendix table 2)
Data was still trained with the same five models mentioned in the modeling section. We used the same model stacking strategy. Stacked grid search: 40% gBoost, 25% Linear, 35% RandomForest.
This time, the training score was a little higher than our last round training score. However, we noticed that the test RMSLE score is much lower. The training and test score are closer to each other.
To be continued:
- For feature removing, instead of randomly removing one of the highly related features (high correlation value), we could do a comparison test to see which feature may contribute better in the model.
- For some numeric features we didn’t consider the distribution of the data. We could do some further check to consider features which may not be normally distributed. According to the result, we could use log or square strategies to normalize data.
Table 1: 61-feature list selected by using AIC, Lasso, Boost
Tabel 2 : 71-feature list selected by AIC both selection procedure