Predicting house prices using Machine learning model
House prices are affected by various features such as home functionality, area of house, kitchen condition, garage quality, etc. Purchasing house is a life time investment that requires enough research to make right decision at right time. From customer point of view, this project aims to provide tool to decide which houses are undervalued or overvalued on the basis of various features, so that one can save sum bucks in one’s pocket after buying house. From the company point of view, what are the conditions that could be improved to sell the house at better price and satisfy customers for their life time investment. Also, the real world problem including housing problem are set for competition using machine learning to solve in sites like Kaggle, Zigbang, etc and these techniques are highly demanded in market. Other goal of our group was to master these machine learning techniques. The main goal of this project was to predict house price in Iowa using various machine learning techniques.
Data:For this project we used Ames Housing Dataset introduced by Professor Dean De Cock in 2011. Altogether, there are 2919 observations (including trainee sets) of housing sales in Ames, Iowa between 2006 to 2010 with 79 provided features.
Data pipeline: Our group did not take this only as a student project but tried to build an automated process as possible to predict sales price, so that it can be used in future instances. For building simple and coefficient co-up, we performed our own research about the data and when ideas got matured, it was deployed in specific directory and controlled (as shown in figure below). We made two pipelines- one data transforming pipeline using various transformers such as NAN remover, scaler, outlier remover, etc. and another model building pipeline for predicting house prices. The general schematic of our pipelines for fast iteration process is shown below:
Simple but efficient co-op environment
Pipelines for fast iterationEDA visualization: In order to better understand the data, we started with exploratory data analysis. Here are some of the examples of data visualization. For more detailed EDA visualization, please check our Github link below. We first checked a correlation plot to identify features that are highly correlated with one another.
Correlation map between various household featuresHere is an example of some boxplot of OverallQaulity vs SalesPrice where the plot is linear while plot of SalesPrice vs GarageCars is not linear which is due to outliers.
Box plot of SalesPrice vs OverallQuality (right) and SalesPrice vs GarageCarsAutomatic data transformer for outliers: In order to remove outliers, we first measured Z score, created effective strategy to remove outliers, and finally mad a pipeline element to remove outliers as shown below.
Here is an example of how we removed NaN for garage related data. We tried to remove NaN using automatic outlier remover however, it did not improve our accuracy. So 31 out of 35 NaN containing columns were taken care manually.
For building our house model, we started with Lasso Regression, which was fast and easy to apply. It also gave us good result. Then we tested other supervised learning methods such as Ridge, Elastic net, SVR random forest, Boosting, and compared the results. Finally, we ensemble various models into meta-model by stacking and averaging. The figure below illustrates our final house model.
Stacking: Stacking is a method that combines predictions of several different models. Using this method, various ML algorithm can be combined to produce better predictions. This is a powerful machine learning approach as it can incorporate models of many various types . In this way, the weaknesses of one model can be compensated by the strengths of another. Various models can be combined using meta regressor, an algorithm that combines different models.
Averaged Model: We also used another approach of combining various models by assigning weight to each models. But averagin model did not perform better than stacking model, so we used stacking approach to make our house model.
Hyperparameter Tuning: To make our model work better tuning parameter is very important process. Tuning is very important, as poor choice of hyperparameters results in over-fit or under-fit of the data in a model. We can adopt three different ways to tune hyperparameters namely, random search, grid search, and Bayseian optimization. For our house model, we used grid search as various references were available, easier to use, and small data size to work with. We used 5-fold cross validation using grid search for tuning hyperparameter.
Results: The final results of various models that we used for predicting sales price are summarized in the table below:
|Model||Train RMSE||Kaggle Score|
But for our house model, we did not use Xgboost and lightGBM as accuracy was not better. The results for our final house model are presented in the table below.
|Model||Train RMSE||Kaggle Score|
Kaggle result: We were placed in top 3% (11/18/2018 ) in Kaggle competition and was the highest Kaggle score achiever in the September, 2018 cohort. Here is our Kaggle score distribution graph with number of submissions.
Kaggle score distribution Conclusions: For our house model, we used stacking which performed better than averaging model. So it was not possible to analyze the effect of individual features on the house sale price. From this model, we conclude following points:
- All of the parameters such as skewness , Z score, variables to select as well as tuning model hyperparameter are very important for predicting house sale price
- Feature engineering such as fill Na ( 0 , Mode , median ), binning and using domain knowledge, etc. were most important to predict the house sale price
- GBM/Xgboost though powerful to give good benchmark solutions but not always not best choice for model fitting
Future directions: In order for making this model work better, we would consider improving following works in future -
- Tune parameter for Xgboost, lightGBM model in future and predict better accuracy
- Apply clustering analysis to create new features
- Investigate more feature engineering
- It would be nice to get time series event data and study the effects of 2008 recession on house sale price and predict its effect in case of recession in future
Our team: Basant Dhital, Jiwon Chan, and SangYon Choi completed this project. Please find all codes for this project in following Github link https://github.com/chazzy1/nycdsaML.