Predicting Housing Tax with Machine Learning Models
Californians who buy a house often experience sticker shock when they get their property tax bill. The reason for the dramatic spike in in real estate tax is the execution of California's Proposition 13 of 1978, which equates the tax assessed value with the purchase price. In an increasing market, that will always result in a steep increase of the real estate tax value, especially when the house has not changed ownership for an extended period of time. While equating the the tax value with the market value may seem fair, the lack of yearly assessments creates disadvantages for multiple parties.
- New property owners have to pay a higher effective tax rate as the real estate tax value is assessed on the latest market value, giving homeowners that have owned their house for a longer period of time a tax advantage.
- In declining markets, the real market value of the real estate might be lower than the real estate tax value, generating an artificial higher tax load as the tax value is not adjusted.
- In an increasing real estate market, the real estate tax value is not adjusted yearly, creating a loss in tax revenue for the governing body.
As indicated by these disadvantages, the problem is a double-edged sword that creates disadvantages for both the real estate owner and the governing body. A possible solution could be to re-assess all real estate every year in order to ensure that the tax value equals the market value; however, such a process would be very cumbersome and time consuming in light of the fact that there are 3.5 million houses in Los Angeles alone. Consequently, in order to assist in determining the market value of real estate with the aim of setting the tax value equal to the market value the following research question is coined: Can machine learning algorithms help in the prediction of real estate tax value and thereby reduce the imbalance in tax load and governmental loss of tax revenue?
Methodology
In order to create a model that might assist in the prediction of the real estate tax value, market information is required. A current Kaggle competition provides information on 2.9 million houses in three counties in the United States, one if which is Los Angeles (https://www.kaggle.com/c/zillow-prize-1), That makes it an excellent source of information, though we must begin by addressing the generalizability of the sample.
Generalization of results
The concept of generalization, from an academic point of view, implies that a sample, on a number of characteristics, does not significantly differs from the population. Additionally, the concept of generalization indirectly assumes that the data is randomly sampled, which is required for any statistical test. With regards to the comparison of characteristics of the sample and population (US Census bureau), chi square tests were performed on the building construction year and the estimated value of the house (the house tax value), where both test statistics were found insignificant. However, even if these tests might indicate that the data is comparable to the population, it’s not a perfect match. The fact that the data was conveniently sampled because transaction information was used to select the houses reduces the generalizability of the data to transferability. This difference is only slight, as it implies that the conclusion from the sample dataset can be used for the population; however, when the sample dataset would increase, the sample data will not resemble the population more.
Cleaning the dataset
The quality of data input is one of the key factors to ensure accurate machine learning prediction accuracy. In order to ensure that the data quality is sufficient, the following cleaning work-flow was performed:
- Calculate the amount (percentage) of missing data in each variable (feature)
- Impute missing data
- Check if each variable has the correct data type
- Detect and winsorize outliers
Overall, with regard to the missing values, it is possible to indicate that 41.3% of the observations were missing. The variables with the highest number of missing values are the building class type, the story type, the size of the basement and the size of the garden. In most of these variables, the value is not missing at random, as the house could simply not have a garden or basement. However, in other case, like the building class type, clearly the value is missing, as every house can be classified by a certain building type. In order to process the information within machine learning models, these missing values must be imputed as machine linear algorithm cannot handle missing value.
To make appropriate and reasonable decisions on the imputation methodology and data type, each variable was compared with the description provided by Kaggle. Additionally, logical reasoning was used for imputation. The following methods were used for some of the manipulations:
- For variables with over 99% missing data, the variable was deleted due to lack of useful information.
- For numeric variables, the median was imputed.
- For factorial variables, the 0 or mode was imputed.
- With regards to the variables, such as the pool type and basement size, zero was imputed. The assumption is that the data is not recorded due to the fact that the property does not have such a feature.
- For variables that are missing totally at random, the mode was imputed. This technique was chosen in order to prevent the generation of additional levels within factor variables.
Next, the winsorization technique was used on all numerical variables. This transformation is performed to eliminate potential adverse effect that extreme values (outliers) may have prediction model. Observations that fall below the 97.5th quantile and beyond 2.5th quantile were replaced with the mean value.
Data Exploration
As part of any data analysis, exploratory data analysis must be performed. This analysis will ensure that the researcher has a good understanding of the data and can use this understanding and possible findings as an input for machine learning modelling. One could therefore even state that exploration is a prerequisite for machine learning modelling. The exploratory data analysis can be split into two sections: (1) analyzing the dependent variables, (2) analyzing the independent variables.
Dependent Variable
Investigating the distribution of the dependent variable, the real estate tax value, indicates two distinct peaks, an indication that might imply that multiple processes are driving the outcome. The 1st peak is the land tax, which differs strongly from the house tax (2nd peak), and might even be referred to as zero inflated. This occurs most probably because the land tax, in many cases is equal to zero, which is true for example apartments or houses with a very low land value. Attempting to predict the real estate tax value in its current shape could result in problems as the data (log transformed) is not normally distributed. Therefore, it is advisable to split the independent variable, retail tax, in land tax and house tax and predict these individually. However, here we’ll focus on the house tax prediction in order to reduce the complexity and length of this article.
Independent Variables
By reviewing the independent variables, it is possible to gain a glance of the real estate market within Los Angeles. With over 50 independent variables, only a small selection will be presented within this blog post. For example, the real estate total tax value is concentrated between $150,000 and $200,000. Second, most of the houses within the dataset are built before 1959, with a decline in the total number of constructions within later years.
Another interesting variable is the available perimeter. The data indicates that the majority of the houses have 1500 square feet of living space. Additionally, the data indicates that the house perimeter ranges between 200 and 2400, excluding outliers in the data.
Based on these insights and insights from other exploratory analysis not described in this blog post, hypotheses can be formulated.
Creating new variables based on Clustering
In order to predict the real estate tax value, a dataset is required that catches the influential factors, which combined, result in an accurate prediction. However, as indicated in the previous sections, the large number of missing values has resulted in the loss of a large number of columns that maybe could have held valuable information. In order to retain some of this lost information, a clustering analysis was executed by means of K-Means clustering in which eight groups were identified. These groups were determined by observing the reduction in the gradient of the function of the within cluster variation and the number of clusters. As this process can be arbitrary in the absence of a strong inflection point, the overall reduction of the within cluster variation was observed for different K, resulting in a K of 8. Further analysis of this new variable indicated that the groups significantly differ with regards to the tax value and could therefore provide additional information that was not available within the existing variables. This new variable can now be used within the real estate tax predictions models.
Hypotheses formulation and testing
In the presence of a robust data set that has been cleaned and investigated for uncovered patterns, hypotheses formulation and testing can be performed. Within daily practice, this process is mostly done based on exploratory data analysis; however, from a statistical standpoint, it should be performed based on causal relationships, which can then be investigated through correlation studies. Therefore, a set of hypotheses was defined, and consequently tested through bivariate analysis.
In order to evaluate these hypotheses the Pearson’s product-moment correlation, Welch Two Sample t-test and the Kruskal-Wallis rank sum test were used.
Variable | Test | Statistic | P Value | Cor | Alt Hyp | Con |
airconditioningtypeid | Welch Two Sample t-test | 54.269 | p-value < 2.2e-16 | Positive | Reject H0 | |
bathroomcnt | Pearson's product-moment correlation | 179.33 | p-value < 2.2e-16 | 0.5962435 | Positive | Reject H0 |
bedroomcnt | Pearson's product-moment correlation | 80.7 | p-value < 2.2e-16 | 0.3170245 | Positive | Reject H0 |
buildingqualitytypeid | Pearson's product-moment correlation | -32.534 | p-value < 2.2e-16 | -0.1335384 | Positive | Cannot reject |
calculatedfinishedsquarefeet | Pearson's product-moment correlation | 175.53 | p-value < 2.2e-16 | 0.5880057 | Positive | Reject H0 |
poolcnt | Welch Two Sample t-test | 32.77 | p-value < 2.2e-16 | Positive | Reject H0 | |
yearbuilt | Pearson's product-moment correlation | 111.12 | p-value < 2.2e-16 | 0.4180533 | Positive | Reject H0 |
unitcnt | Pearson's product-moment correlation | -0.11402 | p-value = 0.9092 | -0.0004722 | Positive | Cannot reject |
propertylandusetypeid | Kruskal-Wallis rank sum test | 2231.5 | p-value < 2.2e-16 | Difference | Reject H0 | |
heatingorsystemtypeid | Kruskal-Wallis rank sum test | 11822 | p-value < 2.2e-17 | Difference | Reject H0 |
It is interesting to note that the hypotheses for the building quality type, which was hypothesised to be higher for higher levels of structural tax, turned out to be negatively correlated. Consequently, we were not able to reject the H0 hypotheses and could not add this variable to the model because it was not logical from a causal perspective. The other hypotheses, except the unit count, which showed insignificant results, the other hypotheses indicated significant relations that allowed for the rejection of the H0 hypotheses. Consequently, these variables will be added to the various machine learning models.
Machine Learning to Predict the Housing Tax
In order to construct a machine learning model, a researcher has a large number of options, and therefore an initial selection must be made. This selection process first focusses on the type of variable one aims to predict, which in this case is a numeric variable, making this a regression problem. Within the regression family various models are available, ranging from linear regression, lasso regression, random forests and boosted random forests. With the aim of constructing a parsimonious model that can predict the real estate tax as accurate as possible, these four machine learning models will be investigated. It is important to note that for all the applied techniques, training and testing was performed with K-folded separation of data with a 80/20 ratio, respectively.
Multiple Linear Regression
A multiple linear regression models is a model that aims to find the best linear unbiased estimators under the Gauss Markov assumptions. Within this model multiple variables can be combined to predict one particular outcome where the relationship between the independent variables and dependent variable are assumed to be linear. Initial results from a linear regression model, where the real estate tax value is determined based on the size of the house, the construction year, the property land type, the number of bathrooms and bedrooms, the type of airconditioning, the number of pools and the earlier introduced clustering variable, indicates that this model can predict 66.4 % of the variance within the data, which can be considered a medium fit. However, condition verification of the Gauss Markov assumptions indicate that the assumption of constant variance is violated, making the estimators’ significance unreliable and poses the probability of creating an overfitted model. Consequently, with the aim to correct for this violation of equal variance, which is driven by the violation of normality, the box cox transformation is performed.
The box cox transformation aims to reduce the level of skewness within the dependent variable. Reducing the level of skewness should reduce the level of unequal variance within the model. The model result indicates that the R^{2} decreases to 60.05 in comparison to the 66.4% without the transformation. This decrease can be explained by certain variables losing significance and no longer contributing to explaining variance within the model. Consequently, the box cox transformed model can be considered more parsimonious than the model without transformation.
In a further attempt to create the best parsimonious model, automatic variables imputation can be performed. This modelling technique is based on a multiple linear regression model, where the Bayesian information criterion (BIC) is used to determine the most parsimonious model out of all possible model combinations. The downside of this modelling technique is that the variables that are used within the model are no longer driven by underlying causal relations, but only based on their contribution to the reduction of the residual. This can result in models that are parsimonious but prone to overfitting the data. Nonetheless, the results from this model indicate an R^{2} of 60.08%, making it as strong as the box cox transformed model, though it is based on a different set of variables, which in most cases do not have any causal relation. Consequently, from these three models, the box cox transformed models is the most reliable and parsimonious.
Lasso Regression
In the previous section, variable selection was performed by imputing variables through the BIC; however, there are other options available for the selection of variables for a model like the Lasso Regression. In Lasso regression, shrinkage/regularization is performed for variable selection where Lasso regression attempts to minimize the error while also minimizing the number of variables used for prediction. The balance between the goodness of fit and the prevention of overfitting is determined by lambda. To determine this tuning parameter lambda, 10-folds cross validation was performed (see the figures below). Through this technique, it is possible to determine the best lambda that minimizes the mean square error, which indicates the prediction error. Through Lasso regression, it was possible to improve the prediction to and R^{2} of 68.1, in comparison to 60.08% from the box cox transformed model, which is an increase of 8%.
Random Forest
In the last two machine learning approaches, the focus lied on using numerical variables for linear prediction, where categorical variables are used as dummies.
However, as the dataset contains a multitude of categorical variables, the Random Forest machine learning method is introduced. Random Forest models are considered as an important statistical pattern recognition tool for prediction with categorical variables. As Lasso Regression, the Random Forest machine learning algorithm also requires cross validation in order to determine the tuning parameter. The tuning parameters for Random Forest are the number of variables tried at each tree split and the total number of trees. Cross validation indicated that the number of variables tried at each split of 4 provides the best fit, while reducing the computation time, and a total number of trees of 100 is sufficient to capture the total reduction in the prediction error. Based on these tuning parameters the Random Forest model predicts with an accuracy of R^{2} 01.66%, which is comparable to the Lasso regression model.
Boosting
With the aim of constructing a model that can predict the real estate tax value as close as possible, the boosting machine learning model is used. The boosting machine learning model is based on tree bagging, which is used to reduces the prediction variance, but in addition uses the last model in order to construct the next model. This technique will enhance the prediction power on the training data set but is prone to overfitting the testing data set. In order to fit a boosting machine learning model, three tuning parameters must be determined, which are the shrinkage, the tree depth and the number of trees. Through cross validation, the calculation of the mean square error and the Boosting test error plot (presented below) the tuning parameters are determined to be a shrinkage of 0.001 and a depth of 4. Based on these tuning parameters the boosting model’s R^{2} is 88.1%, which is a strong improvement in comparison to the random forest model. However, as indicated earlier, the boosting model is prone to overfitting the training data, which implies that the model is weak in prediction out of sample. Consequently, validation on the testing dataset only indicates an R^{2} of 37.8, which is a very strong decline in prediction power.
Conclusions and limitations
Within this project, a multitude of machine learning algorithms were used with the aim of predicting the real estate tax value in order to automate the real estate valuation process and reduce the bias within the California tax system. Overall it can be concluded that the models are able to predict the real estate tax value with medium accuracy, as indicated in the discussion of the machine learning models, where the Random Forest machine learning model presents the best results. This medium fit is the result of the poor quality of the dataset used within this analysis. If better information, with the emphasis on the less missing values, is available, higher levels of accuracy can be reached. However, analysis of the California tax system revealed one of the underlying problems which prevent accurate prediction of the tax value. The tax value of a house is determined at the moment a house is sold, as indicate in the introduction. This implies that two identical properties of equal value can have a great amount of variation in their assessed value, even if they are next to each other. Consequently, with this dataset, it is impossible to capture 100% of the variance, even when models are overfitted.
Overall, this research project can serve as a proof of value. It indicates multiple shortcomings within the California tax system and that predicting the real estate tax value might be a good approach to automate the real estate tax evaluation process. Nonetheless, for further research, a complete dataset that contains information on the actual market value of the house would be better, in order to prevent the misclassification of the households real estate tax value due to the existing time dimension in the tax value assessment.