Don't Underestimate OLS & Intuition!

Mikolaj Wilk
Ted Dogan
and
Posted on Nov 22, 2019

Amongst all the modeling tools available, Ordinary Least Squares (OLS) is often overlooked, as it is overshadowed by the more glamorous and seemingly more comprehensive models. In fact, depending on the application, OLS offers better performance with much less complexity, making both more efficient and interpretable. With the application of domain knowledge and critical thinking in feature selection, you can get results that outperform more complex models and produce greater business insight. To demonstrate this, we have applied various modelling techniques on the housing dataset from Ames, Iowa and analyzed the results. 

 

Methods

 

While pre-processing and reviewing the data, we added the following features to the dataset: time since the home was built, time since the garage was built, and time since the home was remodelled. We optimized various modelling techniques, including, Ordinary Least Squares (OLS), Feasible Generalized Least Squares (FGLS), Ridge & Lasso regression, Random Forests, and XGBoost. We first tried OLS to establish a benchmark and to see its limitations under the assumption that the data was homoskedastic. Since we identified that at least some of the features were in fact heteroskedastic but could not identify which ones or whether the variables that were omitted were important, we applied FGLS to account for this. Penalized models, ridge and lasso regression helped us to get an understanding of the features. To further understand feature importance we applied Random Forests and XGBoost. 

We considered the feature importance from the models based on correlation, coefficients, significance of the p-values, and the feature selection within the penalized models, as well as the tree based models. However, we also carefully assessed the features from a more subjective and intuitive perspective based on domain knowledge. 

 

Analysis & Results

 

Often cited in real estate, “location, location, location” is said to be the most important factor in the value of a home or property. However, our models applied much less importance to the neighborhood of the home. This was one major feature which we knew intuitively that was important, but the models were not picking up, so we decided to explore further.

Below are three graphs that illustrate the relationship between the sale price, neighborhood and three other features: Kitchen Quality, Basement Quality, and Zoning. 

 

FIGURE 1: SALE PRICE VS NEIGHBORHOOD & KITCHEN QUALITY

As shown in figure 1 above, the homes which were in the more expensive neighborhoods generally had excellent quality kitchens. 

 

FIGURE 2: SALE PRICE VS NEIGHBORHOOD & BASEMENT QUALITY

As shown in figure 2 above, the homes which were in the more expensive neighborhoods generally had excellent quality basements. 

 

FIGURE 3: SALE PRICE VS NEIGHBORHOOD & ZONING  

As shown in figure 3 above, the homes which were in the more expensive neighborhoods generally were in “Residential Low Density” zones

These graphs, among other variables, demonstrate that the neighborhood actually plays a bigger role than our models tended to describe. Therefore, the neighborhood was certainly one of the most important features.  

 

FIGURE 4: FEATURE IMPORTANCE 

As shown above, neighborhoods were not among the most important features within the Random Forest and XGBoost models. However, when the neighborhood feature was included in a 5 variable OLS model, this model outperformed the Random Forest based on test set RMSE as shown below in figure 5. It also performed relatively well compared to other models with many more variables. This 5 variable model used log(sale price) based on:

1) overall quality, 2) neighborhood, 3) log(above ground living area SF), 4) basement quality, 5) time since remodel. 84.4% of variation in sale price is explained by this 5 variable model.

 

FIGURE 5: MODEL RESULTS

As shown above in Figure 5, our best performing model was still a linear regression model, Feasible Generalized Least Squares (FGLS), with 27 variables. For this model, we used a stepAIC function and manually selected variables to reduce AIC. FGLS was used because it improved upon the OLS model by accounting for heteroskedasticity, potential outliers in the data, and limiting the effects of omitted features. This model outperformed the penalized models as well as the tree based models. Nevertheless, the OLS 5 variable model is simpler, more interpretable, and can be transferable to a comparable city performed relatively close to the FGLS model. Therefore, you should  never underestimate OLS and domain knowledge when optimizing models as a data scientist! 

About Authors

Mikolaj Wilk

Mikolaj Wilk

Data Science fellow at NYCDSA with a knack for pattern recognition and unlocking value with data and effective collaboration. Being a published scholar, a scoutmaster, and having a background in physics, Mikolaj has a unique, well rounded skillset.
View all posts by Mikolaj Wilk >
Ted Dogan

Ted Dogan

With the right approach and tools, it is possible to create a digital footprint of behavior. Actions leave traces of preferences, a raw form of demand. Converting them into a meaningful business story is an art that requires...
View all posts by Ted Dogan >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Classes Demo Day Demo Lesson Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet Lectures linear regression Live Chat Live Online Bootcamp Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Lectures Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking Realtime Interaction recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp