Machine Learning - House Price Prediction - Ames, Iowa

Avatar
Posted on Aug 4, 2019

Team:  Lan Mond,  Yuqin Xu, Fred Zeng

Background

This project is aimed at developing a model(s) to predict the price of a given house in Ames, Iowa. The model(s) will serve as a tool for real estate investment firms to assess whether the price of a given house is higher or lower than its true value. Therefore, the firms can make investments decision accordingly.

Ames had a population of 55,647 in 2007 and had a steady growth from 1990 to 2015 at 2.7%.  Iowa State University has  36,321 students  in Fall 2017 which makes up to 50% of Ames's population. Ames is also an agriculture town and famous for its potato production. Ames year round temperature is between -30 ~ 100 Fahrenheit.

Datasets

The datasets includes a training dataset of 1460 homes & 79 features and a test dataset of 1459 homes & 79 features which recorded real home sales in 2007 to 2010.

EDA

1. price distribution:

countmeanstdmin50%max
1460180921.2079442.5034900163000755000

The training dataset: 1460 homes sale price has a lightly right skewed  distribution with an average at $180,921. It is similar with the testing dataset.

2. Price vs GrLiArea:

From the price distribution and price scattered over the above ground living area, we can see the following:

  • the price distribution is right skewed
  • linear relationship exists between price and above ground living area especially when price is lower than $4000
  • two homes over 4500 square footage sold less than $20000 clearly should be taken out as outliers

3. Price ~ Neighborhood

5. Price vs Overall Quality

The above shows the distribution of the house prices for each quality level. The majority houses score between 9 ~ 4. Since the number of houses to each level are different, the model accuracy may vary over different quality levels. Also, the neighborhood location of a house clearly affects its sale price.

Data Cleaning

Missing Value Imputation

  1. Non_exist data imputation
  2. Mode imputation
  3. Random imputation
  4. Other imputation

The two datasets are derived from national Multiple Listing Service. MLS collects listing data mostly by drop down options (except for numeric information) thus ensures the formality and completeness of the data. 'NA' is an optional value for some categorical features. Since the 'NA' values are treated as missing in python, therefore, the first step of dealing with missing values should be examining the categorical feature's value list.  Then we compare the related features and replace the blanks with either 0 or 'No'.  For example: "GarageType", "GarageFinish", "GarageQual",  "GarageCond",  "GarageYrBlt" all show 5.55% missing values which occurred to the houses which have no garages.  Consequently, they should all be replaced with either 'No' or 0.

Lot frontage is likely to reflect the property size in a given neighborhood. We imputed the 17.74% missing values with the median of the associated neighborhood.

We imputed several categorical features via the mode, as the distribution of existing values in those features were dominated by those single elements. Other categorical, with more evenly distributed possibilities, were imputed randomly in proportion to those potential candidates.

Dropping Useless Variables

Upon examination, we found that only one observation in utilities column listed with a value; And no value at all in the test dataset. Thus the utilities is an uninformative feature and should be dropped from the datasets.

Removing Outliers

The two largest houses sold below normal price were likely abnormal sales between special related family members. It is reasonable to treat them as outliers and remove from the training dataset.

Feature Engineering

The dataset has 79 features and we can classify them  into four categories based on the nature of their values:

Totalcontinuousdiscretenominalordinal
7919142323

1.Casting numerical features to categorical

‘MSSubClass’, ‘OverallQual’, ‘OverallCond’, ‘YrSold’, ‘MoSold‘, ‘GarageYrBlt’ ,‘YearBuilt’, ‘YearRemodAdd’

2. Adding variables

The datasets provided TotalBsmtSF , 1stFlrSF , 2stFlrSF square footage features which all show positive linear relation with price.


With Total Square Feet = TotalBsmtSF + 1stFlrSF + 2stFlrSF which shows 0.833 correlation with Price. We decided to add Total Square Feet as a new feature

Total Square Feet

Inflation index plays a big role in economy. We believe that it is reasonable to add annual inflation index to normalize house prices to reflect the annual economic change. '2006': 240.85; '2007': 246.07; '2008': 247.23; '2009': 247.03; '2010': 245.29.

3. Scaling and Encoding Variables

Numeric Convert

Some numerical features such as YrSold contains ordinality. For some years, a given house is more appealing to buyers than other years. Consequently, we prepare these features for encoding and converted them into string values.

LabelEncoding

For ordinal categorical features, we replace the labels with numeric rankings 0, 1, 2, ... to reflect the relative order represented by the feature values.

Label Encoding

OneHot Encoding ( Dummies)

For categorical features without ordering, one-hot encoding were applied. After encoding, we have 221 features in total.

4. Transforming Variables

With Fischer-Pearson coefficient of skewness computation, we were able to identify the features with a distributional skewness >1.5. We apply Box-Cox transformation to these features and normalized them to enhance the performance of the models

After log transformation: Log transformation shows a normal price distribution for training better models.

Modeling

1. Elastic-Net

RMSE: 0.11103479383295042

Elastic-net is a regularized linear regression. It has two most important parameters: α-parameter which penalizes estimator coefficients; L1-ratio parameter which determines the balance & weight between Lasso & Ridge.

The result proves the strong linearity but missed the potential non-linear features.

2. Kernel Ridge Regression(KRR)

The attempt was to employ a flexible set of nonlinear prediction functions modulated by a penalty term to avoid overfitting. The kernel defines a high-dimensional inner-product subspace to easily identify linear relationships within nonlinear neighborhoods. 

Unfortunately, the new model failed to provide a better performance and produced RMSE of 0.1115 .




Linear   -->   non-linear
kernel space -->  original space
RMSE: 0.11156861825
1.Alpha 2.Kernel: polynomial 3.Degree

3. Stochastic Gradient Boosting

Gradient boosting decision tree (GBDT) iterates over sequential subtrees to minimize the loss function. It sort through the hierarchy of feature space to learn linear, nonlinear, and interactive features.

Stochastic gradient boosting: at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner. The benefit of the stochastic process over normal gradient boosting is a reduction in variance, as subtrees are de-correlated from one another.

Stochastic gradient boosting improved RMSE to 0.1086.


RMSE: 0.108662258056165

4. XGBoost

XGBoost introduced a regularising γ-parameter to control the complexity of tree partition. The higher the γ, the larger the minimum loss threshold to split the tree at each leaf node.

XGBoost gave an RSME of 0.0487. This much superior result raised our suspicions, provoking us to apply it to the test set and result in RSME of 0.1156 : overfitting. 

Given the breadth of hyper-parameter space to search through and the high computational costs, plus the time constraints on the project, we decided to improve the XGBoost in the future. 

5. Model Ensembling

Model Ensembling is a powerful technique for improving prediction by stacking multiple base models to achieve superior performance.

The general idea is to incorporate the predictions of the base models as additional features into the training set. The stacking model will fit onto it.

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. Below is a step-wise explanation for a simple stacked ensemble:

This meta-model weighs the predictions in order to highlight the strengths of the respective base models and smooth out their weaknesses. A diverse selection of base models, sufficiently decorrelated as to capture linear, hierarchical, and nonlinear relationships in our data, is therefore desirable.

Base models:
1) Elastic Net (0.1110)
2) Stochastic Gradient Descent (0.1087)
3) Kernel Ridge Regression (0.1115)

Meta model: Lasso
Train RMSE: 0.0689
Test RMSE: 0.1086

Results & Feature Importance

We chose Lasso regression as meta-model, elastic-net, KRR, and stochastic GBDT as base models. Altogether we achieved an improved train RSME of 0.0689 and a test RSME of 0.1086.

The top ten features from our model:

Lessons & Improvement

The training & testing datasets are relatively clean and neat with 79 features. But some categorical features have different optional values. These minor structural differences were missed at the beginning and cost quite some trouble when we test the first model. We corrected by stacking the two datasets and split back to training & testing after data cleaning & feature engineering. Lesson learned: It is always more efficient to conduct thorough data investigation before any action.

Further improvement:

Feature Engineering

Hyper-parameter

Tuning Bayesian Optimization

About Author

Avatar

Lan Mond

Recently certified as Data Scientist and Masters in Electrical Engineering alongside with rich international business experience in helping companies to gather and analyze data to make more informed decisions regionally and globally to achieve their business goals while...
View all posts by Lan Mond >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp