Using Data Science to Predict Ames House Prices
Thanks to the advances of technology, people can explore almost everywhere online. Based on data collected, when it comes to real estate, while many people still prefer seeing a property in person, they often start with a virtual visit. Pictures alone don't provide all the information that buyers need to know like the actual measurements of the rooms. Other factors, such as the house built year, size of the lot, bathroom quantity, and basement quality, are also carefully considered by house buyers.
The purpose of this project was to study the data of the house features and find out what would affect the price of a house. This house price dataset from Kaggle contains 79 explanatory variables for houses in Ames, Iowa, and one target variable as the house price with 2930 observations.
For a more specific study for the dataset, the 79 variables can be broken down as 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The categorical variables range from 2 to 28 classes with STREET variable, which has gravel and paved only, and NEIGHBORHOOD variable, which includes all areas in Ames. The discrete variables represent the number of rooms in the house. The continuous variables contain square feet of each area of a house which could be tricky because some of them indicate the same measurement in different terms.
Prior to finding a model, the target variable, sale price, had to be investigated and then feature engineering. I also had to do feature selection to remove outlier observations and duplicated variables as well as to create dummified values from categorical variables.
First of all, I made a histogram of sale price. It was right-skewed, so, the logarithm transformation was applied as Figure1. The correlation between dummified features from each categorical variable and the sale price was calculated. I kept the features that have a moderate correlation. Then, the correlation within a variable to the rest was researched to see the multicollinearity and to remove it if there is any. Figure2 shows the correlation with all variables after dummification.
After taking care of given categorical variables, all discrete variables were also checked to know whether the variable is close to categorical type, which requires a dummify process, or to continuous one.
For example, house overall quality was treated as a continuous while full bathroom as a categorical. Moreover, since some of the continuous variables included "none" in the values, they could cause the model to become inaccurate without appropriate processing. For example, Figure3 shows the garage year built and it would register as if the garage was built in 0(zero) instead of none-meaning no garage. Therefore, I made the kind of variables in the categorical as well such as none, older, and newer.
After all, I chose 46 features which have a correlation score higher than 0.2 to the sale price.
In this project, I used Linear, Lasso, Ridge, ElasticNet, SGDRegressor, RandomForestRegressor, SVR, KernelRidge, and XGBoost to find out the best model. Grid search was used to obtain the best estimator for each model and the result of R2 score and RMSE(Root Mean Square Error) is shown in the table below.
Model | RMSE | R2 |
LinearRegression | 0.158 | 0.835 |
Lasso | 0.158 | 0.834 |
Ridge | 0.157 | 0.837 |
ElasticNet | 0.157 | 0.836 |
SVR | 0.198 | 0.742 |
KernelRidge | 0.158 | 0.834 |
SGDRegressor | 2.3e^12 | -4.968 |
RandomForestRegressor | 0.155 | 0.845 |
XGBoost | 0.150 | 0.856 |
As a result, I can see that SGDRegressor didn't work well because R2 score has a negative value and an unreasonably high RMSE value. Other models have a decent R2 and low RMSE. Since XGBoost had the highest R2 and the lowest RMSE, I chose this model for house price prediction with parameters as figure 4.
To sum up, if you are a seller in Ames city, you may want to focus on quality for house items such as kitchen and basement to increase the price. Other features are important as well, but it's difficult to change things like the total square feet of living area or full bathroom quantity unless you undertake major renovations.
One possible way to improve the prediction result would be trying different features.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.