Predicting House Price in Ames, Iowa Using Machine Learning

Posted on Sep 8, 2019

I. Background 

Origin of dataset, categorical vs. numerical variables, objective

The Ames Housing Dataset was introduced by Professor Dean De Cock in 2011 as an alternative to the Boston Housing Dataset. It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010.

There are 20 continuous (numerical), 14 discrete (numerical), 23 nominal (categorical), and 23 ordinal (categorical) features describing each house’s size, quality, area, age, and other miscellaneous attributes.

For this project, our objective was to apply machine learning techniques to predict the sale price of houses based on their features. This is part of the data science competition on Kaggle. Users are challenged at minimizing Mean Squared Log Error (MSLE) on a test set with target values withheld from the publicly available data set, using machine learning algorithms.

II. Data Exploration

Missing values, skewness, multicollinearity, feature importance

We took 6 distinct steps to accomplish data exploration, cleaning,  transformation and engineering, respectively, to get dataset ready for modeling part. 

Item 1: Sales price (response variable has outliers), and distribution of sales price is highly skewed.

Solution: (1) We removed the outliers, and (2) We take log transformation of sales price and replaced log(sales price) as our new response variable.

Item 2: Understand correlation and multicollinearity among predictive variables.

Solution: (1) First of all, we calculated the correlation between sales price and variables and listed them in descending order:

(2) Next, we created correlation heatmap to understand the correlation among all variables (only showing selected variables here due to file size):

Item 3: Dealing with (lots of) missing values. Large quantities of missing data are accounted for in the documentation associated with the data set.

Large quantities of missing data are accounted for in the documentation associated with the data set. While standard missing values indicate that the feature is not present in the house, there could be other reasons why certain values are missing. It is important to identify the reason for missingness and fill in missing values to the best of our judgement.

Solution: We categorized our dealing with missing values into the following buckets:

(a) NaN = None (categorical variables)

Based on data description, we found out that NaN actually means there is none (or 0). We then filled in NaNs with None.

Example variable(s):
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond','BsmtQual’, 'BsmtCond', 'BsmtExposure', 'BsmtFinType1’ etc.

(b) NaN = 0 (numerical variables)

Based on data description, we found out that NaN actually means there is none (or 0). We then filled in NaNs with 0.

Example variable(s):
'GarageYrBlt’,'GarageArea’,'GarageCars','MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath’ etc.

(c) NaN = Assume it is typical

Based on data description, NaN actually means it belongs to the category “typical”.

Example variable(s):

(d) Missingness does not imply none/0, data description does not help to infer the exact meaning of missingness, but we can impute from existing data and existing information.

'LotFrontage’ is dependent on the zoning district in which a property is located (i.e. ‘Neighborhood’). Therefore, we grouped by Neighborhood & LotArea and fill in missing value by the median LotFrontage of each neighborhood.

Example variable(s):

(e) Value still missing:

Next, after filling in and/or imputing missing values, we observed how many NaNs still exist. Luckily, there are very few missing (<5%) for each of the variables that still have missing values. Given the small percentage, we decided to fill in missing values by taking the most common value.

Example variable(s):

No missing values now! Here is a recapture of our solutions to missingness:

Item 4: Some numerical variables are highly skewed. We checked for numerical features where abs(skewness) is > 0.5 and found out that there are 25 highly to moderately skewed numerical variables:

25 highly to moderately skewed numerical variables

Solution: By convention, abs(skewness) >1 implied highly skewed, and 1 > abs (skewness) > 0.5 implied moderately skewed. We took a conservative approach and applied box-cox transformation to those numerical variables that are considered to be highly to moderately skewed. This effectively reduced the number of skewed numerical to 16 from 25:


After box-cox transformation, there are 16 highly to moderately skewed numerical variable

Item 5: Understand feature importance.

Solution: Different models select feature importance based on how well the model is able to capture the relationship between the predictive and response variable. Below is a feature importance computed from a random forest model on the engineered data set:

Item 6: Data Engineering.

Solution: We observed high correlations between variabls that are related to areas/SF (square foot) and decided to create 3 new area/SF related variables:

We ended up dropping these 3 newly engineered variables for our final model submitted, as they alone did not improve our score and/or reduced our RMSE.

This concludes our data pre-processing part and now our data is ready for modeling.

III. Predictive Models

Ridge, Lasso, ElasticNet, Random Forest, GDBoost, LightGBM, XGBoost

(1) Machine learning algorithms candidates:

We selected both linear and non-linear based machine learning algorithms as our model candidates:

Each models comes with its inherent advantages and disadvantages:

(2) Individual results & parameter tuning (auto vs. fine tuned parameters)

Here, we compared each model's performance (using mean cross validation score, model score, and RMSE as criteria) under different tuning methods. We found out that on average, for same model, the results from manually fine tuned parameters performed better than auto parameters. Noticeably, manually fine tuned parameters take significantly more computing power and resources.

An additional observation is that Lasso alone performed extremely close to ElasticNet, making us wonder if it makes sense to remove ElasticNet as it seemed to pick Lasso alone over Ridge, instead of striking a balance between Lasso and Ridge.

(3) Model stacking (different stacking methods)

There are different methods to stack different models. We experimented the following 3 methods:

(i) Stacking using StackingRegressorCV: an ensemble-learning meta-regressor for stacking regression);

(ii) Simply averaging across all models; and

(iii) Weighted average: assigning more weight to those models with lower RMSE

We used each of the 3 stacking methods above with and without including ElasticNet mode, and we measured the stacking performance on RMSE on both training data and test data.

Using StackingRegressorCV without ElasticNet model gave us the highest test score and became our final stacking method.

IV. Results and Conclusion

Stacking models, Kaggle results, Thoughts

Our final submission gave us a RMSE of 0.11582 on Kaggle's own test dataset, which positioned us in the top 15% of all the candidates as of July 30, 2019.

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp