Using Data to Predict House Prices in Ames, Iowa
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Team: Lan Mond, Yuqin Xu, Fred Zeng
Data Science Background
This project aims to develop a model(s) to predict the price of a given house in Ames, Iowa. The model(s) will serve as a tool for real estate investment firms to assess whether the price of a given house is higher or lower than its actual value. Therefore, the firms can make investments decision accordingly. In this text, we will use data to predict house prices in Ames, Iowa.
Ames had a population of 55,647 in 2007 and had a steady growth from 1990 to 2015 at 2.7%. Iowa State University has 36,321 students in Fall 2017 which makes up to 50% of Ames's population. Ames is also an agriculture town and famous for its potato production. Ames year round temperature is between -30 ~ 100 Fahrenheit.
Datasets
The datasets includes a training dataset of 1460 homes & 79 features and a test dataset of 1459 homes & 79 features which recorded real home sales in 2007 to 2010.
EDA
1. Using Data to Analyze Price Distribution:
count | mean | std | min | 50% | max |
1460 | 180921.20 | 79442.50 | 34900 | 163000 | 755000 |
The training dataset: 1460 homes sale price has a lightly right skewed distribution with an average at $180,921. It is similar with the testing dataset.
2. Using Data to Analyze Price vs GrLiArea:
From the price distribution and price scattered over the above ground living area, we can see the following:
- the price distribution is right skewed
- linear relationship exists between price and above ground living area especially when price is lower than $4000
- two homes over 4500 square footage sold less than $20000 clearly should be taken out as outliers
3. Price ~ Neighborhood
4. Price vs Overall Quality
The above shows the distribution of the house prices for each quality level. The majority houses score between 9 ~ 4. Since the number of houses to each level are different, the model accuracy may vary over different quality levels. Also, the neighborhood location of a house clearly affects its sale price.
Data Cleaning
Missing Value Imputation
- Non_exist data imputation
- Mode imputation
- Random imputation
- Other imputation
The two datasets are derived from national Multiple Listing Service. MLS collects listing data mostly by drop down options (except for numeric information) thus ensures the formality and completeness of the data. 'NA' is an optional value for some categorical features. Since the 'NA' values are treated as missing in python, therefore, the first step of dealing with missing values should be examining the categorical feature's value list. Then we compare the related features and replace the blanks with either 0 or 'No'. For example: "GarageType", "GarageFinish", "GarageQual", "GarageCond", "GarageYrBlt" all show 5.55% missing values which occurred to the houses which have no garages. Consequently, they should all be replaced with either 'No' or 0.
Lot frontage is likely to reflect the property size in a given neighborhood. We imputed the 17.74% missing values with the median of the associated neighborhood.
We imputed several categorical features via the mode, as the distribution of existing values in those features were dominated by those single elements. Other categorical, with more evenly distributed possibilities, were imputed randomly in proportion to those potential candidates.
1. Dropping Useless Variables
Upon examination, we found that only one observation in utilities column listed with a value; And no value at all in the test dataset. Thus the utilities is an uninformative feature and should be dropped from the datasets.
2. Removing Outliers
The two largest houses sold below normal price were likely abnormal sales between special related family members. It is reasonable to treat them as outliers and remove from the training dataset.
Feature Engineering
The dataset has 79 features and we can classify them into four categories based on the nature of their values:
Total | continuous | discrete | nominal | ordinal |
79 | 19 | 14 | 23 | 23 |
1.Casting numerical features to categorical
‘MSSubClass’, ‘OverallQual’, ‘OverallCond’, ‘YrSold’, ‘MoSold‘, ‘GarageYrBlt’ ,‘YearBuilt’, ‘YearRemodAdd’
3. Adding variables
The datasets provided TotalBsmtSF , 1stFlrSF , 2stFlrSF square footage features which all show positive linear relation with price.
With Total Square Feet = TotalBsmtSF + 1stFlrSF + 2stFlrSF which shows 0.833 correlation with Price. We decided to add Total Square Feet as a new feature
Inflation index plays a big role in economy. We believe that it is reasonable to add annual inflation index to normalize house prices to reflect the annual economic change. '2006': 240.85; '2007': 246.07; '2008': 247.23; '2009': 247.03; '2010': 245.29.
4. Scaling and Encoding Variables
Numeric Convert
Some numerical features such as YrSold contains ordinality. For some years, a given house is more appealing to buyers than other years. Consequently, we prepare these features for encoding and converted them into string values.
LabelEncoding
For ordinal categorical features, we replace the labels with numeric rankings 0, 1, 2, ... to reflect the relative order represented by the feature values.
OneHot Encoding ( Dummies)
For categorical features without ordering, one-hot encoding were applied. After encoding, we have 221 features in total.
5. Transforming Variables
With Fischer-Pearson coefficient of skewness computation, we were able to identify the features with a distributional skewness >1.5. We apply Box-Cox transformation to these features and normalized them to enhance the performance of the models
After log transformation: Log transformation shows a normal price distribution for training better models.
Modeling
1. Elastic-Net
Elastic-net is a regularized linear regression. It has two most important parameters: α-parameter which penalizes estimator coefficients; L1-ratio parameter which determines the balance & weight between Lasso & Ridge.
The result proves the strong linearity but missed the potential non-linear features.
2. Kernel Ridge Regression(KRR)
The attempt was to employ a flexible set of nonlinear prediction functions modulated by a penalty term to avoid overfitting. The kernel defines a high-dimensional inner-product subspace to easily identify linear relationships within nonlinear neighborhoods.
Unfortunately, the new model failed to provide a better performance and produced RMSE of 0.1115 .
3. Stochastic Gradient Boosting
Gradient boosting decision tree (GBDT) iterates over sequential subtrees to minimize the loss function. It sort through the hierarchy of feature space to learn linear, nonlinear, and interactive features.
Stochastic gradient boosting: at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner. The benefit of the stochastic process over normal gradient boosting is a reduction in variance, as subtrees are de-correlated from one another.
Stochastic gradient boosting improved RMSE to 0.1086.
4. XGBoost
XGBoost introduced a regularising γ-parameter to control the complexity of tree partition. The higher the γ, the larger the minimum loss threshold to split the tree at each leaf node.
XGBoost gave an RSME of 0.0487. This much superior result raised our suspicions, provoking us to apply it to the test set and result in RSME of 0.1156 : overfitting.
Given the breadth of hyper-parameter space to search through and the high computational costs, plus the time constraints on the project, we decided to improve the XGBoost in the future.
5. Model Ensembling
Model Ensembling is a powerful technique for improving prediction by stacking multiple base models to achieve superior performance.
The general idea is to incorporate the predictions of the base models as additional features into the training set. The stacking model will fit onto it.
Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. Below is a step-wise explanation for a simple stacked ensemble:
This meta-model weighs the predictions in order to highlight the strengths of the respective base models and smooth out their weaknesses. A diverse selection of base models, sufficiently decorrelated as to capture linear, hierarchical, and nonlinear relationships in our data, is therefore desirable.
Results & Feature Importance
We chose Lasso regression as meta-model, elastic-net, KRR, and stochastic GBDT as base models. Altogether we achieved an improved train RSME of 0.0689 and a test RSME of 0.1086.
The top ten features from our model:
Lessons & Improvement
The training & testing datasets are relatively clean and neat with 79 features. But some categorical features have different optional values. These minor structural differences were missed at the beginning and cost quite some trouble when we test the first model. We corrected by stacking the two datasets and split back to training & testing after data cleaning & feature engineering. Lesson learned: It is always more efficient to conduct thorough data investigation before any action.
Further improvement:
Feature Engineering
Hyper-parameter
Tuning Bayesian Optimization