Don't Underestimate OLS & Intuition!

Posted on Nov 22, 2019

Amongst all the modeling tools available, Ordinary Least Squares (OLS) is often overlooked, as it is overshadowed by the more glamorous and seemingly more comprehensive models. In fact, depending on the application, OLS offers better performance with much less complexity, making both more efficient and interpretable. With the application of domain knowledge and critical thinking in feature selection, you can get results that outperform more complex models and produce greater business insight. To demonstrate this, we have applied various modelling techniques on the housing dataset from Ames, Iowa and analyzed the results. 




While pre-processing and reviewing the data, we added the following features to the dataset: time since the home was built, time since the garage was built, and time since the home was remodelled. We optimized various modelling techniques, including, Ordinary Least Squares (OLS), Feasible Generalized Least Squares (FGLS), Ridge & Lasso regression, Random Forests, and XGBoost. We first tried OLS to establish a benchmark and to see its limitations under the assumption that the data was homoskedastic. Since we identified that at least some of the features were in fact heteroskedastic but could not identify which ones or whether the variables that were omitted were important, we applied FGLS to account for this. Penalized models, ridge and lasso regression helped us to get an understanding of the features. To further understand feature importance we applied Random Forests and XGBoost. 

We considered the feature importance from the models based on correlation, coefficients, significance of the p-values, and the feature selection within the penalized models, as well as the tree based models. However, we also carefully assessed the features from a more subjective and intuitive perspective based on domain knowledge. 


Analysis & Results


Often cited in real estate, “location, location, location” is said to be the most important factor in the value of a home or property. However, our models applied much less importance to the neighborhood of the home. This was one major feature which we knew intuitively that was important, but the models were not picking up, so we decided to explore further.

Below are three graphs that illustrate the relationship between the sale price, neighborhood and three other features: Kitchen Quality, Basement Quality, and Zoning. 



As shown in figure 1 above, the homes which were in the more expensive neighborhoods generally had excellent quality kitchens. 



As shown in figure 2 above, the homes which were in the more expensive neighborhoods generally had excellent quality basements. 



As shown in figure 3 above, the homes which were in the more expensive neighborhoods generally were in “Residential Low Density” zones

These graphs, among other variables, demonstrate that the neighborhood actually plays a bigger role than our models tended to describe. Therefore, the neighborhood was certainly one of the most important features.  



As shown above, neighborhoods were not among the most important features within the Random Forest and XGBoost models. However, when the neighborhood feature was included in a 5 variable OLS model, this model outperformed the Random Forest based on test set RMSE as shown below in figure 5. It also performed relatively well compared to other models with many more variables. This 5 variable model used log(sale price) based on:

1) overall quality, 2) neighborhood, 3) log(above ground living area SF), 4) basement quality, 5) time since remodel. 84.4% of variation in sale price is explained by this 5 variable model.



As shown above in Figure 5, our best performing model was still a linear regression model, Feasible Generalized Least Squares (FGLS), with 27 variables. For this model, we used a stepAIC function and manually selected variables to reduce AIC. FGLS was used because it improved upon the OLS model by accounting for heteroskedasticity, potential outliers in the data, and limiting the effects of omitted features. This model outperformed the penalized models as well as the tree based models. Nevertheless, the OLS 5 variable model is simpler, more interpretable, and can be transferable to a comparable city performed relatively close to the FGLS model. Therefore, you should  never underestimate OLS and domain knowledge when optimizing models as a data scientist! 

The skills the authors demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

About Authors

Mikolaj Wilk

Data Scientist, Developer, Scholar and Scoutmaster with a background in Physics, determined to unlock business value and solve problems with analytical and data-driven methods.
View all posts by Mikolaj Wilk >

ted dogan
View all posts by ted dogan >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI