Don't Underestimate OLS & Intuition!
Amongst all the modeling tools available, Ordinary Least Squares (OLS) is often overlooked, as it is overshadowed by the more glamorous and seemingly more comprehensive models. In fact, depending on the application, OLS offers better performance with much less complexity, making both more efficient and interpretable. With the application of domain knowledge and critical thinking in feature selection, you can get results that outperform more complex models and produce greater business insight. To demonstrate this, we have applied various modelling techniques on the housing dataset from Ames, Iowa and analyzed the results.
Methods
While pre-processing and reviewing the data, we added the following features to the dataset: time since the home was built, time since the garage was built, and time since the home was remodelled. We optimized various modelling techniques, including, Ordinary Least Squares (OLS), Feasible Generalized Least Squares (FGLS), Ridge & Lasso regression, Random Forests, and XGBoost. We first tried OLS to establish a benchmark and to see its limitations under the assumption that the data was homoskedastic. Since we identified that at least some of the features were in fact heteroskedastic but could not identify which ones or whether the variables that were omitted were important, we applied FGLS to account for this. Penalized models, ridge and lasso regression helped us to get an understanding of the features. To further understand feature importance we applied Random Forests and XGBoost.
We considered the feature importance from the models based on correlation, coefficients, significance of the p-values, and the feature selection within the penalized models, as well as the tree based models. However, we also carefully assessed the features from a more subjective and intuitive perspective based on domain knowledge.
Analysis & Results
Often cited in real estate, βlocation, location, locationβ is said to be the most important factor in the value of a home or property. However, our models applied much less importance to the neighborhood of the home. This was one major feature which we knew intuitively that was important, but the models were not picking up, so we decided to explore further.
Below are three graphs that illustrate the relationship between the sale price, neighborhood and three other features: Kitchen Quality, Basement Quality, and Zoning.
FIGURE 1: SALE PRICE VS NEIGHBORHOOD & KITCHEN QUALITY
As shown in figure 1 above, the homes which were in the more expensive neighborhoods generally had excellent quality kitchens.
FIGURE 2: SALE PRICE VS NEIGHBORHOOD & BASEMENT QUALITY
As shown in figure 2 above, the homes which were in the more expensive neighborhoods generally had excellent quality basements.
FIGURE 3: SALE PRICE VS NEIGHBORHOOD & ZONING
As shown in figure 3 above, the homes which were in the more expensive neighborhoods generally were in βResidential Low Densityβ zones
These graphs, among other variables, demonstrate that the neighborhood actually plays a bigger role than our models tended to describe. Therefore, the neighborhood was certainly one of the most important features.
FIGURE 4: FEATURE IMPORTANCE
As shown above, neighborhoods were not among the most important features within the Random Forest and XGBoost models. However, when the neighborhood feature was included in a 5 variable OLS model, this model outperformed the Random Forest based on test set RMSE as shown below in figure 5. It also performed relatively well compared to other models with many more variables. This 5 variable model used log(sale price) based on:
1) overall quality, 2) neighborhood, 3) log(above ground living area SF), 4) basement quality, 5) time since remodel. 84.4% of variation in sale price is explained by this 5 variable model.
FIGURE 5: MODEL RESULTS
As shown above in Figure 5, our best performing model was still a linear regression model, Feasible Generalized Least Squares (FGLS), with 27 variables. For this model, we used a stepAIC function and manually selected variables to reduce AIC. FGLS was used because it improved upon the OLS model by accounting for heteroskedasticity, potential outliers in the data, and limiting the effects of omitted features. This model outperformed the penalized models as well as the tree based models. Nevertheless, the OLS 5 variable model is simpler, more interpretable, and can be transferable to a comparable city performed relatively close to the FGLS model. Therefore, you should never underestimate OLS and domain knowledge when optimizing models as a data scientist!