NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Creating an interpretable model of the Ames Dataset

Creating an interpretable model of the Ames Dataset

Zach Stone
Posted on Dec 13, 2022

This work was done with the guidance of the Data Science with Machine Learning bootcamp at NYC Data Science Academy. Sample code used for the research can be found on github.

Background

The Ames Housing Dataset is a feature-rich collection of home listings in Ames, IA along with their sale price. This dataset is commonly used to demonstrate the need for feature selection, model tuning, and other techniques in supervised learning. Linear models and decision tree models are considered the most appropriate and high-performing models for sale price prediction. While out-of-the-box models can account for up to 92% of the variance in sale price on the full dataset, accuracy can be improved by removing outliers. However, this research primarily trained and tested models on the full dataset.

Goals & results

While more powerful supervised methods were tested as well (XGBoost, Random Forest, Lasso), the primary goal here was to use feature selection to create an interpretable, comparably performing, linear model. Compared to random forests or other ensemble models, a linear model is much more interpretable: the coefficients represent specific rates - in this case, dollar values - associated with the features.

All models were trained and tested on a cleaned and engineered version of the dataset. Some observations were adjusted to account for missing values or values that were logically inconsistent (e.g., garages with no square footage and car space, or remodel dates before the build date). Additionally, some categorical features were binned based on domain knowledge and exploratory analysis. To increase confidence in the specific values assigned to the selected features, the techniques for feature selection were specifically chosen to reduce the standard error of the coefficients without severely impacting accuracy. The final models had the following advantages:

  • Standard linear regression and lasso models on the selected feature set were much less overfitted than other models. The reduced linear model obtained train/test R2 scores of 92.6%/91.8% and an average cross-validation score of 91.1% on the full dataset.
  • The final model used a subset of features which made a compromise between the reliability of the coefficients and AIC/BIC scores. This model improved the MAE by $40k on test data compared to the null model.
  • Feature selection led to a model with drastically smaller confidence intervals around its coefficients, dropping from an average of over 530% relative standard error per feature in the full model compared to 23% in the reduced model. Having reliable coefficients shows the contribution of each feature to the final appraisal.

The first point refers to the linear models performing similarly on training and test sets and in cross validation, while more advanced models, like XGBoost and Random Forests, showed a significant difference between their training and test scores. The third bullet refers to the improved reliability of the estimated value contributed to the sale price of each home by its features. These can be used to evaluate the benefits of various potential improvements when preparing a house for sale, compare investments, or evaluate the reliability of various inspections of the house.

Accuracy

Plot showing the predicted vs actual price on a test set The listings with the 10 largest residuals are highlighted

The plot above shows the predicted vs. actual price on a test set consisting of 30% of the dataset when the model is trained on the remaining 70%, giving an R2 of 91.8% and a mean absolute error (MAE) of $14,418.43. For comparison, the null model which guesses the average sale price of the training set has a MAE of $55,061.91, showing that the linear model improves over the null model by over $40,000 per listing on average. Of the listings with the 10 largest residuals, four of them are the four most expensive houses in the test set. When these outliers are removed, the MAE drops slightly to $13,678.69, rounding the R2 up to 92.1%. Additionally, using the same subset of features on the full dataset, a standard linear model had an average R2 score of 91.1% in a 5-fold cross-validation test.

The out-of-the-box linear model on the full set of (cleaned and engineered) features had a slightly higher train/test scores of 93.2%/92.0%, though the test score is comparable to the reduced model. Additionally, the average cross-validation score on the full dataset was 87.4%. An F-test does reveal that the full model performs better than the reduced model on the full dataset. However, the increased difference between the train and test scores and the significantly lower cross-validation score indicate that the full model is slightly overfitting when compared to the reduced model.

Comparing models

By comparison, an out-of-the-box Gradient Boosting Regressor on the full set of features had a train/test score of 96.2%/91.9%, and average score of 90.0% on a 5-fold cross-validation test. (Though, it should be remarked that ordinals were encoded as numeric for both tree models, while individual values were dummified in the linear models.) While this model overall has higher accuracy, the test and cross-validation scores indicate that it is overfitting more than the linear models (reduced and full). Tuning the number of estimators and tree depth exacerbated the problem, raising the training accuracy to nearly 100% while the test accuracy remained the same.

Similarly, a tuned Random Forest Regressor was also overfitted with 98.2%/88.3% train/test scores, and cross-validation confirmed a score close to the test score. However, a properly tuned CatBoost algorithm can reach an average 92.6% cross-validation score on the full dataset with all observations and features. The metrics above show that the interpretable linear model, on an appropriately selected subset of features, suffers only a slight injury to R2 score when compared to these more powerful models on test and cross-validation sets (at least compared to when they are trained and tested on the full feature set).

The tuned Lasso model is the least overfitted of the models and performed comparably on the cross-validation test. When only the subset of features from the reduced linear model are used, tuning the Lasso model by cross-validation pushes it towards the regular linear model without any penalty. That is, the alpha value becomes very small, so that no penalty is applied. The result is that the model becomes almost identical to an unpenalized model.

Interpretability

The feature selection methods used were intended to compromise between reducing the standard error of coefficients while maintaining model accuracy. Since each coefficient will have units of $/unit (or just $, as with the constant term), they are all measured on a ratio scale, so it makes sense to compute their relative standard error (RSE). This is defined as the standard error divided by the value of the coefficient itself, and it is a unitless ratio. The confidence interval of a coefficient is roughly the value of the coefficient ยฑ 2 * the standard error. If this interval contains 0, we cannot be sure that the feature contributes meaningfully to the model, since the coefficient is not statistically distinguishable from 0. Hence, the following are equivalent:

  • The feature going with a coefficient is a statistically significant predictor.
  • The absolute value of the coefficient is larger than 2 * standard error.
  • The RSE of the coefficient is less than 50%.

Features were selected such that the reduced model had all coefficients with RSE less than 50%, so that all of them were significant. Moreover, the average RSE in the reduced model was 23%, compared to the full model, where the average RSE is over 530%, meaning that the average standard error was over five times the value of the coefficient itself in the full model.

Numeric features

Thirteen numeric features persisted through the selection process. Eight corresponded to area. Since the total finished interior square footage is the sum of the square footage of each room type, one has to choose either the total areas or the individual room areas -- but not both -- in order to avoid linear dependence (see here). Only the total interior square footage was used, here separated into 1st and 2nd floor square footage, both valued at about 50 $/sq.ft. This performed better than using the area of individual types of finished rooms. Other areas, including outdoor, garage, finished basement, and low quality areas, were also appraised by the model, as was the surface area of masonry veneer. These values could be used, e.g., to estimate the ROI on constructing certain extensions of a home or remodeling unfinished areas.

Information relating to the age of the house and when it was remodeled account for two more significant numeric features. We can see that houses depreciate at a value of over 300 $/year, while the advantage from remodeled depreciates at a value of about 100 $/year.

Additionally, certain counts were treated as numeric features. These deserve some explanation.

  • Most homes have only one fireplace, if at all, whose evaluation possibly acts as a proxy for the 'grandeur' of the home.
  • The number of full basement bathrooms acts as a proxy for whether the basement is also a livable area, with most listings having 0, 1, or 2 basement full bathrooms.
  • The last is the most surprising, with increased bedrooms corresponding to a decrease in sale price. However, inspecting the sale prices against the bedroom count makes more sense in light of the dwelling types.

The average number of bedrooms per listing is 2.85, so many homes near this average will incur a comparable penalty. Many of the listings with a higher number of bedrooms, incurring a higher penalty, are one of two types: more modern 2-story homes, or duplexes. In those homes having four or more bedrooms, the lower end of the price distribution is mostly occupied by duplexes - which generally sell for lower than other dwelling types - while the higher end is occupied by the modern 2-story homes. The penalty incurred by the bedroom count for those homes may be offset by their other advantages, namely being newer and having larger square footage than other single-family homes.

Categorical features

Many of the remaining features are dummified categorical variables. While we will not cover all of them here, most of them fell into one of the following categories: (1) neighborhood, (2) exterior type, (3) dwelling type, (4) condition and quality inspection ratings, or (5) data about the functionality of the basement. Neighborhood is, as expected, one of the strongest predictors.

The neighborhoods above were identified by the selection process as contributing significantly to the house price -- either as an advantage or penalty. These advantages/penalties are in comparison to houses in all neighborhoods not listed. For listings in other areas, the neighborhood does not significantly contribute to the price of the house.

Another strong predictor was the dwelling type. All dwelling types which came out as significant through the selection process were penalties, corresponding to different alternatives to traditional single-family dwellings. In particular, duplexes, 2-family conversions, and planned unit developments all incur penalties. The dwelling types not deemed significant - i.e., incurring no penalty - were exactly the single-family homes, regardless of type (1, 1.5, or 2 story, split-level, etc.). This supports the argument from this section that duplexes sell for less than other types of dwellings, which partially accounts for the negativity of the coefficient for bedroom counts.

Finally, the overall quality and condition inspections were also strong predictors, both rated on a scale of 1-10. For condition, 5 was the median value, and all values except this median and the extreme values of 1 and 10 were reliable predictors. Moreover, significant scores below the median (i.e., 2-4) were all penalties, while significant scores above the median (i.e., 6-9), were all advantages. As expected, the contribution to price changes monotonically with the score among the significant scores.

Similarly, 6 was the median quality score, and all scores 4-10 except the median were significant. Again, the significant values below the median (4 and 5) were penalties, while those above the median (7-10) were advantages. The contribution of quality to price is nearly monotonic, with the penalties associated with scores 4 and 5 within each other's confidence interval.

These estimates of the value of various ratings demonstrate that such inspections can be reliable metrics, whose results contribute significantly to the appraisal. However, most other inspection results did not reliably contribute to the sale price.

Data preparation

The Ames dataset is incredibly rich with 79 features, covering a range of numeric, ordinal, and nominal categorical information about each listing. The numeric data includes square footage of various types of rooms, the length of the perimeter touching a street, and information about the age of the house and renovations; the ordinal data contains different room counts and inspection results; and the nominal data includes the neighborhood names, materials used in various house features, and information about the nearby environment.

Other than encoding the various features, e.g., converting Likert-like ratings to integers and dummifying categorical features, minimal changes were made to the data. Outliers were kept, and the features used in the linear model were not rescaled. A few logical inconsistencies were adjusted based on available information - for example, one remodeling year was dated after the house build date. Additionally, a few listings were missing a small amount of information, such as the type of electrical system used in the house, which could generally be imputed with the majority value without drastically changing the dataset.

Otherwise, only a few features were engineered: (1) many categorical features had a high number of categories, which were binned based on domain knowledge (e.g., various types of veneer were collapsed into meta-categories, such as 'brick', 'wood', or 'manufactured') in order to not have an explosion of dummy features, and (2) the quantity formed from the product of unfinished basement square footage with the basement quality inspection gave a more reliable metric than either independently. The idea behind this second metric is that there may be a gradation of the types of unfinished basement, though it did introduce the naive assumption that the value of a square foot of unfinished basement varies linearly with the quality inspection.

Feature selection methods

A number of feature selection methods were used in combination to select a combination of features whose individual values were reliable, while also maintaining model accuracy and generalizability. A number of factors can contribute to the (relative) standard error of the coefficients. Just to name two:

  • If too many predictive features are used, it can be difficult for the model to determine the contribution of each
  • If there is a linear combination of all observations of a set of features f1, f2, ... which is sufficiently close to zero (i.e., the features are nearly linearly dependent), the arbitrary multiples of these coefficients can be added to the model coefficients without changing the prediction much. Hence, a huge range of coefficients will be possible, each producing models making nearly identical predictions, so we will not be able to rely on the particular values a model decides on.

The first is especially relevant, since dummification of the many categorical features in the dataset explodes the dimension of the data. The second is sometimes referred to as multicollinearity, though it is important to note that this refers to collinearity in the space whose coordinates represent different observations (of the same feature), not in the space whose coordinates represent different feature values (of the same observation), though data is typically visualized in the latter.

Lasso

One technique which can address both is using the coefficients of a lasso model to determine the significance of each feature. Such models penalize large coefficients in a way which coerces some of them very close to zero. A lasso model was tuned using cross-validation on the training set, and those features with small coefficients were removed. The histogram below shows that a large majority of features can be excluded this way, with many having coefficients very close to 0.

Linear dependence

However, the remaining features were still heavily linearly dependent in the sense described above. For example, certain totals (like square footage) are by definition the sum of other features (like the square footage of each room type), and so a linear combination of them equalling exactly zero exists. Removing a feature which is (close to) a linear combination of the remaining features will not reduce the information available to the model. However, as the way in which families of features may depend on each other can be complex, it is not always clear which feature to remove to reduce the linear dependence.

To handle this, an iterative method was used consisting of the following steps (1) see how much each feature is linearly dependent on the others by checking the R2 of the linear model predicting that feature from all other features, (2) loop through the features, starting with the most predictable, and check if removing the feature improves (or at least retains) the accuracy of the model under cross-validation, (3) if removing a feature retains or improves the cross-validation score, drop it, and repeat the process; otherwise, continue down the list until such a feature is found, and, finally, (4) if removing any feature reduces the cross-validation score, terminate the process.

This method heavily reduced linear dependence among the features. While the features remaining after lasso selection are 86.7% linearly predictable from the other features on average, this number is reduced to 56.6% after iterative selection. Despite eliminating many features, the cross-validation score persisted at 90.4% upon removing features this way, as the removed features can largely be captured by the remaining ones.

Finally, additional features were considered for elimination based on high predictability from the remaining features and high standard errors. Various combinations were tested based on AIC/BIC scores. Those which reduced AIC/BIC scores the most without reducing cross-validation scores were removed, resulting in the final selection of features.

Summary

This research attempted tackle an alternative goal with the Ames Housing Dataset: construct an interpretable model which was competitive with the more advanced models through feature selection. This allows us to estimate the contribution of each feature to the sale price in addition to automating the appraisal process. Though taking a slight hit to accuracy, the coefficients in the reduced model were significantly more reliable than in the full model, reducing the average RSE from over 530% to 23%. Additionally, the reduced linear model was less overfit than the more advanced models.

While the reliability of the coefficients was significantly improved, it is important to remark that the values are relative to other houses (or the same house upon changing a feature), as they are offset by a constant. However, such estimates could be used to determine which improvements to make on a house, the estimated annual loss due to the age of the home and/or renovations, and other values important to investors. While Lasso is otherwise the most competitive model, it collapses to regular linear regression upon restricting to the selected features, indicating that the choice of features is appropriate.

About Author

Zach Stone

I am a data scientist with a background in linguistics research and math. I love to make it easier to analyze and draw insights from complex patterns using a combination of research, code, and modeling.
View all posts by Zach Stone >

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application