One might wonder what drives the price of a house? Is it the neighborhood? The size of the house? The amenities? Or something else? We tried to find answers using advanced machine learning techniques. We used the Ames Housing data set from Kaggle to predict house prices in Iowa.
In this blog, we outline our approach to exploratory Data Analysis (EDA), data cleaning, feature engineering and machine learning modeling that enabled us to obtain the top Kaggle score (out of 12 competing groups at NYC Data Science Academy boot camp).
The Data set and Competition
The Ames Housing dataset was compiled by Dean De Cock and is commonly used in data science education, it has 1460 observations with 79 explanatory variables in train dataset describing (almost) every aspect of residential homes in Ames, Iowa. The test data comprises of 1459 observations with 79 explanatory variables. This dataset is part of an ongoing Kaggle housing price prediction competition that challenges you to predict the final price of each home.
The goal of our project was to utilize supervised machine learning techniques to predict the housing prices for each home in the dataset. It was clear that numerous predictors and a heterogeneous dataset, accurately predicting a response variable was going to be a non-trivial task. Our steps towards creating a highly accurate model were as follows:
Exploratory Data Analysis (EDA)
Data cleaning
Missingness imputation
Outlier removal
Dummification
Feature engineering
Add new features
Scaling
Pre Modeling
Cross-Validation ( Hyperparameter tuning)
Modeling
Exploratory Data Analysis (EDA)
Categorization:
We started by exploring and understanding the dataset. We divided our variables into categories: Continuous, Nominal Categorical, Ordinal Categorical, Target variables.
Target Variable:
Sale price is the value we are looking to predict in our project, so it made sense to examine this variable first. The sale price exhibited a right-skewed distribution that was corrected by taking the log. Once the log was taken, we were no longer violating the normality assumption for regressions.
Sale Price before log transformation:
The distribution of price is right-skewed. Log transformation techniques will be used to make the distribution more normal.
The Quantile - Quantile plot also shows that the price value is not normally distributed.
Sale Price after log transformation:
After the log transformation, the price distribution looks much more linearly distributed.
After the log transformation, the Quantile- Quantile plot, shows a much more linear price variable.
This transformation can help improve the linear model’s performance.
Missingness and Imputation:
Next, we decided to look at missing values by feature in the train and test dataset. As you can see below, there was significant missingness by feature across both these datasets. Most of the missing data corresponded to the absence of a feature. For example, the Garage features, mentioned in the below table, showed up as "NA" if the house did not have a garage. These were imputed as 0 or "None" depending on the feature type.
We identified missingness type (MAR: Missing at random, MNAR: Missing not at random, MCAR: Missing completely at random) to decide upon the imputed value. We wanted to test multiple imputation methods on our model, so we designed two imputation methods for a few features. Below is a breakdown of how we handled imputation across all the features.
Correlation Levels:
Another graphical view of our analysis is the following correlation plot that indicates levels of correlation amongst continuous variables, and between continuous features and the response variable (SalePrice).
It definitely aided in the exploration of the data. We found that the Sale Price is strongly correlated with these continuous variables, so we focused on finding outliers from these predictors.
Predictor: Correlation with Price
GrLivArea : 0.708624
GarageCars : 0.640409
GarageArea: 0.623431
TotalBsmtSF : 0.613581
1stFlrSF : 0.605852
TotRmsAbvGrd : 0.533723
FullBath : 0.560664
Categorical variable visualization:
There are more than 40 categorical variables in this data set. We will display some plots to show how these variables affect the price. We imputed missing data for the LotFrontage variable with the mean value of its neighborhood because most of the time, houses in the same neighborhood are of a similar structure and size.
Below is the boxplot of the neighborhood against house prices.
Since the neighborhood is a nominal variable, we do not expect to see a pattern in this boxplot, but it shows that different neighborhoods have different median values and price distributions.
Central air conditioning is an amenity that can increase the price of the house. Below boxplot shows a house with a central air conditioner is generally more expensive than one without.
Multicollinearity:
We would like to see if there is multicollinearity between each numeric columns. The plot below shows there are some columns with extremely high multicollinearity.
Highly multicollinear columns are:
BsmtFinSF1
BsmtFinSF2
BsmtUnfSF
TotalBsmtSF
1stFlrSF
2ndFlrSF
LowQualFinSF
GrLivArea
BsmtFullBath
Outliers:
We wrote a function to detect outliers as shown below:
We first used the Interquartile range to identify a row that contains outliers. Firstly, calculate the interquartile range by using the third quartile subtract first quantile multiply 1.5 to get bound, then we can identify upper/lower bound as code shown below. After identifying outliers, we set N threshold that returns the index of the row that contains N outliers we are given.
Following were the identified outliers in ‘GrLivArea’ that were removed.
Before Removing: After Removing:
Transformation of Predictors:
We then checked for skewness in predictor variables with the idea of applying logarithm / Box cox transformation for highly skewed variables. Following plot helped us in visualizing and identifying them (We transformed the ones tagged with the star):
We applied logarithm transformation, which helped in fixing the skewness.
Before Transformation:
After Transformation:
Feature Engineering
We added new variables to provide enriched information to our model.
Added boolean columns that indicate if the house has a basement, garage, 2nd floor,.etc.
Added total area of house, total full bath number, total half bath, total bathroom above ground
Remod = boolean column depends whether 'Yearremod' is 0
6. Scaled data using Scaler:
MinMaxScaler
StandardScaler
RobustScaler - Less sensitive to Outliers
Modeling
We applied 6 different models on our data sets as follows:
Ridge model
Lasso model
Elastic Net model
Basic Tree model
Random Forest model
XGBoost model
All models used grid search cross-validation function to find the optimized lambda.
In the pre-modeling phase, the train data set have been further divided into training and testing data sets, therefore, we are able to use the price label to calculate the RMSE and R^2 score.
This result shows how different imputation methods and feature engineering can affect the outcome of the prediction. There are several ways of imputing and feature engineering the data set. We tried to compute imputation methods against feature engineering with 3 different linear models. The best result in pre modeling phase is 0.1201 by using Ridge Model with imputation method 2 and feature engineering 1,2,3.
In the final result, we found that adding the rest of 4, 5, 6 feature engineering and change to imputation method 1 improves the result. The record will be shown in the final result section.
We also tried to reduced model by selecting best chi-square score variables.
Compared with the full model, the R^2 is decreased from 0.935592 to 0.848292.
Coefficients of Ridge and Lasso model:
After using a cross-validation grid search, the best lambda for Ridge model is 22.222 and Lasso model is 1e-10.
Coefficient of Ridge model as hyperparameter lambda increases.
Notice that none of the coefficients become 0, as they are going infinitely close to 0.
Coefficient of Lasso model as hyperparameter lambda increases.
In Lasso model, many coefficients became 0 within a lambda value range of 0 to 1e-5. What’s left is coefficients with important features. The pink line and brown lines are two of the important features.
Model complexity analysis:
Ridge Model:
As expected, the RMSE decreases with the increased number of features.
Lasso Model:
When compared with Ridge model, Lasso model’s train and test RMSE difference are smaller than Ridge model. This means Ridge model is more likely to be overfitted. Also, the RMSE of the Lasso model decreases much faster than the Ridge model. Theoretically, the Lasso model will perform better than the Ridge model. However, in our project, the Ridge model had a lower RMSE than the Lasso model.
Final Results: Kaggle Submission
In the final submission, we submitted six different model predictions to Kaggle. Compared with the RMSEs calculated from our pre modeling phase, Kaggle RMSEs are much higher, which indicates that our models are overfitting.
The final RMSE is visualized in the following graph (the lower, the better):
The Ridge model has the lowest RMSE as 0.11694. In contrast, the Tree model has the worst RMSE as 0.19273. XGBoost is the second-lowest. Lasso and Elastic Net models have similar results in the range of 0.137. From this resulting graph, we can tell that this data set is closer to linearly distributed.
Conclusion
This was definitely a rewarding project. Our participation in this kaggle.com competition exposed us to the challenges of machine learning projects and the mindset needed to approach data science problems.
For data cleaning and imputation, the most important thing was to identify the categorical variables and numeric variables. The variable like MS SubClass is a numerical data type, but it actually is a categorical variable. For feature engineering, the data normality for both features and target variables is important to prediction accuracy. For the modeling part, to solve the regression problem, Linear models tend to outperform tree-based in terms of speed and score. Lasso helped to the feature selection because it shrinks a relatively unimportant coefficient to zero.
Another consideration that would actually expand the scope of the problem and its solution is to include and analyze external data involving local policy changes and economic trends in the housing market specific to Ames, Iowa. Perhaps, adding even more data such as school zoning or transportation and commercial information would produce models with more predictive power.
Additionally, as we hone our craft and expand our skills, one aspect we would have liked to explore is the use of more models and different approaches to identify the best solution for this problem. We choose to keep our methods simple and robust in order to learn and ensure our understanding, but perhaps being able to apply newer methods and models will yield better results.
In the future, we would like to apply a stacking technique to improve our model’s score.
Thank you for taking out time to learn about our work. We welcome constructive feedback.
About Authors
Priya Srivastava
Priya Srivastava is an analytical thinker with business acumen. Her first love was STEM, which she pursued in earning a bachelor’s degree in Engineering and building a career as Software Engineer and data warehousing consultant in the technology...
Zhuoyi is an aspiring data scientist who like the challenge of drawing on creative solutions to problems. Alongside completing Master's Degree at New York University (Expected Dec. 2019), He is also a fellow at the NYC data science...