Using Data to Predict House Prices in Ames, Iowa

Posted on Aug 4, 2019

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Using Data to Predict House Prices in Ames, Iowa

Team:  Lan Mond,  Yuqin Xu, Fred Zeng

Data Science Background

This project aims to develop a model(s) to predict the price of a given house in Ames, Iowa. The model(s) will serve as a tool for real estate investment firms to assess whether the price of a given house is higher or lower than its actual value. Therefore, the firms can make investments decision accordingly. In this text, we will use data to predict house prices in Ames, Iowa.

Ames had a population of 55,647 in 2007 and had a steady growth from 1990 to 2015 at 2.7%.  Iowa State University has  36,321 students  in Fall 2017 which makes up to 50% of Ames's population. Ames is also an agriculture town and famous for its potato production. Ames year round temperature is between -30 ~ 100 Fahrenheit.

Datasets

The datasets includes a training dataset of 1460 homes & 79 features and a test dataset of 1459 homes & 79 features which recorded real home sales in 2007 to 2010.

EDA

1. Using Data to Analyze Price Distribution:

Using Data to Predict House Prices in Ames, Iowa
count mean std min 50% max
1460 180921.20 79442.50 34900 163000 755000

The training dataset: 1460 homes sale price has a lightly right skewed  distribution with an average at $180,921. It is similar with the testing dataset.

2. Using Data to Analyze Price vs GrLiArea:

Using Data to Predict House Prices in Ames, Iowa

From the price distribution and price scattered over the above ground living area, we can see the following:

  • the price distribution is right skewed
  • linear relationship exists between price and above ground living area especially when price is lower than $4000
  • two homes over 4500 square footage sold less than $20000 clearly should be taken out as outliers

 

3. Price ~ Neighborhood

Using Data to Predict House Prices in Ames, Iowa

4. Price vs Overall Quality

Using Data to Predict House Prices in Ames, Iowa

The above shows the distribution of the house prices for each quality level. The majority houses score between 9 ~ 4. Since the number of houses to each level are different, the model accuracy may vary over different quality levels. Also, the neighborhood location of a house clearly affects its sale price.

Data Cleaning

Missing Value Imputation

  1. Non_exist data imputation
  2. Mode imputation
  3. Random imputation
  4. Other imputation

The two datasets are derived from national Multiple Listing Service. MLS collects listing data mostly by drop down options (except for numeric information) thus ensures the formality and completeness of the data. 'NA' is an optional value for some categorical features. Since the 'NA' values are treated as missing in python, therefore, the first step of dealing with missing values should be examining the categorical feature's value list.  Then we compare the related features and replace the blanks with either 0 or 'No'.  For example: "GarageType", "GarageFinish", "GarageQual",  "GarageCond",  "GarageYrBlt" all show 5.55% missing values which occurred to the houses which have no garages.  Consequently, they should all be replaced with either 'No' or 0.

Lot frontage is likely to reflect the property size in a given neighborhood. We imputed the 17.74% missing values with the median of the associated neighborhood.

We imputed several categorical features via the mode, as the distribution of existing values in those features were dominated by those single elements. Other categorical, with more evenly distributed possibilities, were imputed randomly in proportion to those potential candidates.

1. Dropping Useless Variables

Upon examination, we found that only one observation in utilities column listed with a value; And no value at all in the test dataset. Thus the utilities is an uninformative feature and should be dropped from the datasets.

 

2. Removing Outliers

Using Data to Predict House Prices in Ames, Iowa

The two largest houses sold below normal price were likely abnormal sales between special related family members. It is reasonable to treat them as outliers and remove from the training dataset.

 

 

Feature Engineering

The dataset has 79 features and we can classify them  into four categories based on the nature of their values:

Total continuous discrete nominal ordinal
79 19 14 23 23

1.Casting numerical features to categorical

‘MSSubClass’, ‘OverallQual’, ‘OverallCond’, ‘YrSold’, ‘MoSold‘, ‘GarageYrBlt’ ,‘YearBuilt’, ‘YearRemodAdd’

3. Adding variables

The datasets provided TotalBsmtSF , 1stFlrSF , 2stFlrSF square footage features which all show positive linear relation with price.

Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

With Total Square Feet = TotalBsmtSF + 1stFlrSF + 2stFlrSF which shows 0.833 correlation with Price. We decided to add Total Square Feet as a new feature

Using Data to Predict House Prices in Ames, Iowa
Total Square Feet

Inflation index plays a big role in economy. We believe that it is reasonable to add annual inflation index to normalize house prices to reflect the annual economic change. '2006': 240.85; '2007': 246.07; '2008': 247.23; '2009': 247.03; '2010': 245.29.

 

4. Scaling and Encoding Variables

Numeric Convert

Some numerical features such as YrSold contains ordinality. For some years, a given house is more appealing to buyers than other years. Consequently, we prepare these features for encoding and converted them into string values.

LabelEncoding

For ordinal categorical features, we replace the labels with numeric rankings 0, 1, 2, ... to reflect the relative order represented by the feature values.

Using Data to Predict House Prices in Ames, Iowa
Label Encoding
Using Data to Predict House Prices in Ames, Iowa

OneHot Encoding ( Dummies)

For categorical features without ordering, one-hot encoding were applied. After encoding, we have 221 features in total.

Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

 

5. Transforming Variables

With Fischer-Pearson coefficient of skewness computation, we were able to identify the features with a distributional skewness >1.5. We apply Box-Cox transformation to these features and normalized them to enhance the performance of the models

Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

After log transformation: Log transformation shows a normal price distribution for training better models.

Using Data to Predict House Prices in Ames, Iowa
Using Data to Predict House Prices in Ames, Iowa

 

 

Modeling

1. Elastic-Net

Using Data to Predict House Prices in Ames, Iowa
RMSE: 0.11103479383295042

Elastic-net is a regularized linear regression. It has two most important parameters: α-parameter which penalizes estimator coefficients; L1-ratio parameter which determines the balance & weight between Lasso & Ridge.

The result proves the strong linearity but missed the potential non-linear features.

 

2. Kernel Ridge Regression(KRR)

The attempt was to employ a flexible set of nonlinear prediction functions modulated by a penalty term to avoid overfitting. The kernel defines a high-dimensional inner-product subspace to easily identify linear relationships within nonlinear neighborhoods. 

Unfortunately, the new model failed to provide a better performance and produced RMSE of 0.1115 .

Using Data to Predict House Prices in Ames, Iowa



Linear   -->   non-linear
kernel space -->  original space
RMSE: 0.11156861825
1.Alpha 2.Kernel: polynomial 3.Degree

3. Stochastic Gradient Boosting

Gradient boosting decision tree (GBDT) iterates over sequential subtrees to minimize the loss function. It sort through the hierarchy of feature space to learn linear, nonlinear, and interactive features.

Stochastic gradient boosting: at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner. The benefit of the stochastic process over normal gradient boosting is a reduction in variance, as subtrees are de-correlated from one another.

Stochastic gradient boosting improved RMSE to 0.1086.

Using Data to Predict House Prices in Ames, Iowa

RMSE: 0.108662258056165

4. XGBoost

XGBoost introduced a regularising γ-parameter to control the complexity of tree partition. The higher the γ, the larger the minimum loss threshold to split the tree at each leaf node.

XGBoost gave an RSME of 0.0487. This much superior result raised our suspicions, provoking us to apply it to the test set and result in RSME of 0.1156 : overfitting. 

Given the breadth of hyper-parameter space to search through and the high computational costs, plus the time constraints on the project, we decided to improve the XGBoost in the future. 

5. Model Ensembling

Model Ensembling is a powerful technique for improving prediction by stacking multiple base models to achieve superior performance.

The general idea is to incorporate the predictions of the base models as additional features into the training set. The stacking model will fit onto it.

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. Below is a step-wise explanation for a simple stacked ensemble:

This meta-model weighs the predictions in order to highlight the strengths of the respective base models and smooth out their weaknesses. A diverse selection of base models, sufficiently decorrelated as to capture linear, hierarchical, and nonlinear relationships in our data, is therefore desirable.

 

Using Data to Predict House Prices in Ames, Iowa
Base models:
1) Elastic Net (0.1110)
2) Stochastic Gradient Descent (0.1087)
3) Kernel Ridge Regression (0.1115)

Meta model: Lasso
Train RMSE: 0.0689
Test RMSE: 0.1086

Results & Feature Importance

We chose Lasso regression as meta-model, elastic-net, KRR, and stochastic GBDT as base models. Altogether we achieved an improved train RSME of 0.0689 and a test RSME of 0.1086.

The top ten features from our model:

Using Data to Predict House Prices in Ames, Iowa

 

Lessons & Improvement

The training & testing datasets are relatively clean and neat with 79 features. But some categorical features have different optional values. These minor structural differences were missed at the beginning and cost quite some trouble when we test the first model. We corrected by stacking the two datasets and split back to training & testing after data cleaning & feature engineering. Lesson learned: It is always more efficient to conduct thorough data investigation before any action.

Further improvement:

Feature Engineering

Hyper-parameter

Tuning Bayesian Optimization

 

 

About Author

Lan Mond

Recently certified as Data Scientist and Masters in Electrical Engineering alongside with rich international business experience in helping companies to gather and analyze data to make more informed decisions regionally and globally to achieve their business goals while...
View all posts by Lan Mond >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI