A Home Pricing Model for Educated Sellers
Introduction
Real estate marketplace pricing model have been known to be sporadic. As Sam Chandan, Dean of NYU’s Schack Institute of Real estate observed in the WSJ article “What Went Wrong With Zillow? A Real-Estate Algorithm Derailed Its Big Bet, a number of factors can affect house prices. A specific example that Chandan mentioned was the difficulty in determining the pricing of the layout of a house.
The complete aesthetic of a single-family home cannot be captured by a simple data point. In Kris Frieswick’s article on WSJ entitled: “Zillow’s Zestimate Is the Algorithm We Love to Hate. Why Can’t We Quit It?” the algorithm is summed up to be finicky and oscillating so much that people have even sued Zillow on the grounds that the company misrepresented the prices of their homes. Even though Zillow’s Zestimate pricing tool or Redfin’s pricing tool may be used by hundreds of millions of people to purchase homes all across the country, no pricing model can be 100% accurate and the oscillations are evidence of inaccuracies.
Objective
Instead of adjusting the price every day or whenever a new feature is available, the aim of this project is to set a baseline price that will not fluctuate very often or drastically. An unwavering baseline price for homes will ultimately allow sellers to be more confident about how much their home is worth. The fluctuations in house prices based on other pricing models can be seen instead as confidence intervals. Any increases in prices using other pricing models will show the potential return on investment, and decreases can gauge the current condition of the market. The aim is to eventually be able to use selective feature engineering to expand our pricing model to the rest of the country.
Method
DATASET
The dataset - containing 81 features (including ID and sale price) and 1460 rows - and the validation set - containing 80 features (including ID) and 1459 rows - were downloaded from Kaggle. In the initial analysis, two rows were deemed to be outliers (bottom right of Graph 1) and were removed. Based on the price per square footage those two homes seemed to have been sold for a much lower price than expected, so they were removed as outliers.

DISTRIBUTION
The pricing distribution for the homes that were sold from 2006 to 2010 (Graph 2) shows that, although more homes were priced around $150K, the overall average price of homes in the market was higher. In order to reduce the skew in the pricing, the logarithm of the prices was taken to create a more even distribution (Graph 3), which will allow for a more accurate prediction when using the validation set.


SIMPLE ANALYSIS
Some simple analysis was done to determine if any features might have an obvious impact on the prices of homes. Graph 4 shown below is an example of the simple analysis performed. Based on the trend seen in the boxplot graph, we can expect the median price of a home to be at least somewhat affected by the neighborhood in which the home is located.

NULL VALUE IMPUTATION
The null values were then imputed. A function (named my_fillna) was written whereby null imputations were performed by inputting the median or the mode depending on whether the features were numerical or categorical, respectively. Some imputations were performed based on the possibility that certain columns would be dependent on another column. One such example is Masonry Veneer Type (MasVnrType) and Neighborhood, which may be dependent. \
With some MasVnrType values missing, my_fillna was used to impute the missing values using the most common MasVnrType found in each neighborhood (Table 1). Veneer type pertaining to certain neighborhoods can be seen where new cookie-cutter home neighborhoods are constructed. Then features were dummified and scaled.

MODEL TRAINING FOR BASELINE
After null values for each feature were imputed, model training was performed using Multilinear regression, Lasso regression, ElasticNet regression, Random Forest regression, Gradient Boosting regression, and XGBoost regression to determine a baseline Root Mean Squared Error (RMSE). Graph 5 is an example of the first round of model training using ElasticNet Regression. The blue dots are the comparison of the Predicted vs the Actual Log Price using the Train set of the train-test-split. The red dots are the same using the Test set of the train-test-split.

INITIAL FEATURE ENGINEERING AND MODEL TRAINING ROUND 2
After the initial model training, any features that could be ordinal were converted to ordered numeric values. Some string values were rewritten to keep consistency across features that used the same values. Then features were dummified and scaled.
Model training was performed again using the 6 previously mentioned regression models to determine if there were any significant increases in RMSE that would point towards removing certain engineered features. Performance increases were noted in the test set with Multilinear regression, Gradient Boosting regression, and XGBoost regression.
FEATURE ENGINEERING USING PERMUTATION OF POWERSETS
In order to perform the best feature engineering for the dataset, a peculiar method was used. Initially, 9 functions were written and each performed some kind of feature engineering to the dataset and added to a list. 9 more functions were written to remove the features that were used in each of the previous 9 functions (if such feature exists) and added to another list.
A powerset (which is a list of all possible combinations of the contents of a list) was created for each of the aforementioned lists. The permutation of the resulting 2 lists was performed to create a list of 92,378 combinations of functions. Since these functions actually applied feature engineering to the dataset, these combinations were applied as functions to the copies of the dataset in a loop.
CLARIFICATION OF PERMUTATION OF POWERSETS
To clarify what was just mentioned regarding the permutation of powersets, there is a simplified explanation in Figure 1. In this example, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, and ‘f’ are functions. We can see that list 1 has formulas a, b, and c, and list 2 contains formulas d, e, and f. Using a formula, a new list was created from the first list containing all the possible combinations (the powerset). The same was done for the second list.
Then a list was created with the permutation of both lists. The permutation allowed for the formulas to be paired and executed without constant manual input. The full list was then used to engineer features of the copies of the dataset and the RMSEs for the 6 models were calculated. The outcome seemed to point towards improved scores across the models.

MODEL TUNING
The models were tuned after the best datasets were engineered. The WSJ Prime Lending rate and seasons of the year were also added during the tuning process, but only Gradient Boosting’s RMSE was lowered upon adding those features.
Results
Table 2 shows the RMSEs for each model after imputation, after the permutation of powersets, and after tuning. The Kaggle Score is available as well. An interesting observation to note is that Graph 5 above shows the Train and Test split RMSEs for ElasticNet’s initial training yielded values that were most similar. The Train set RMSE was 0.1025 and that of the Test set was 0.1029, which is very close and was one of the least overfitting models.

Another observation was that despite the fact that Gradient Boosting had a worse RMSE than did Lasso regression post-tuning, it ended up with the best Kaggle score. We can see in Graphs 6a and 6b that the Kaggle score does not quite match up with SciKitLearn’s RMSE. We can definitely see the difference in scores output by each model and that the RMSEs for the forest-based regressions are ever so slightly lower than those of the linear model. The Kaggle Scores also reflect this trend.
Although we were not able to obtain a perfect Kaggle Score, we will most likely be able to use the information gathered to improve our scores in the future using ElasticNet, Lasso, GBM, and XGBoost.


FUTURE IMPROVEMENTS
In the future, we will improve the scores for the pricing model and ensure a smaller RMSE. We could use KNN Imputer to impute null values rather than using the complicated function that I wrote. We might also improve scores by looking at the skew of each of the features rather than just the skew of the label. Combining the predictions from several models by averaging them may also help.
We will also look at adjusting prices based on inflation but still provide the baseline price as a comparison so that sellers will know the comparative value of their homes. We will gather data from other housing markets to determine which features actually impact the prices of homes in those areas and use that information to create pricing models similar to this one.
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.