Data Analysis on the Ames Housing Dataset
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
The Ames Housing dataset, basis of an ongoing Kaggle competition and assigned to bootcamp students globally, is a modern classic. It presents data on 81 features of houses -- mostly single family suburban dwellings -- that were sold in Ames, Iowa in the period 2006-2010, which encompasses the housing crisis.
The goal is to build a machine learning model to predict the selling price for the home, and in the process learn something about what makes a home worth more or less to buyers.
An additional goal I gave myself was to design an app using the Plotly Dash library, which is like R/Shiny for Python. Having been a homeowner more than once in my life, I wanted a simple app that would give me a sense of how much value I could add to my home by making improvements like boosting the quality of the kitchen, basement or exterior, or by adding a bathroom.
Conceptually, I figured I could build a regression model on the data and then use the coefficients of the key features, which would probably include the interesting variables (e.g., kitchen quality, finished basement). These coefficients can predict what a unit change in a feature would do to the target variable, or the price of the house.
One advantage of the Ames dataset is that it's intuitive. The housing market isn't particularly hard to understand, and we all have an idea of what the key features would be. Ask a stranger what impacts the price of a house, and they'll probably say: overall size, number of rooms, quality of the kitchen or other remodeling, size of the lot. Neighborhood. Overall economy.
To set a baseline, I ran a simple regression on a single feature: overall living space. My R^2 was over 0.50 -- meaning that more than half of the variance in house price could be explained by changes in size. So I didn't want to overcomplicate the project.
The data had a lot of missing values. Many of these were not missing at random but rather likely indicated the feature was not present in the house. For example, Alley, PoolQC and Fence were mostly missing; I imputed a 'None' here. Other values needed special treatment.
LotFrontage was 17% missing. This feature turned out to be closely related to LotShape, with IR3 (irregular lots) having much higher LotArea values. So I imputed values based on LotShape.
There were a lot of categorical values in the data which I knew I'd have to dummify. Before going there, I looked into simplifying some of them. 'Neighborhood' seemed ripe for rationalization. It was clear (and intuitive) that neighborhoods varied in house price and other dimensions. There didn't seem to be a pattern of sale price and sale volume by neighborhood (except for NorthAmes), but Neighborhood and Sale Price were related.
Now I know it isn't best practice to create a composite variable based only on the target, so I created a new feature called "QualityxSalePrice" and split it into quartiles, grouping Neighborhoods into four types.
There were many features related to house size: GrLivArea ('Graded' or above-ground living area), TotalBsmtSF (basement size), 1FlrSF, even TotRmsAbvGrnd (number of rooms). These were -- of course -- correlated, as well as correlated with the target variable. Likewise, there were also numerous features related to the basement and the garage which seemed to overlap.
To simplify features without losing information, I created some new (combined) features such as total bathrooms (full + half baths), and dropped others that were obviously duplicative (GarageCars & GarageArea).
Looking across both continuous and categorical features, there were a number that could be dropped since they contained little or no information; they were almost all in one category (e.g., PoolArea, ScreenPorch, 3SsnPorch and LowQualFinSF).
Finally, I dummified the remaining categorical features and normalized the continuous ones, including the target. The sale price (target) itself was transformed via log, to adjust a skew caused by a relatively small number of very expensive houses.
Data Exploration & Modeling
Everybody knows (or think we know) that it's easier to sell a house at certain times of year, so I looked at sale price by month and year. These plots showed a peak in the fall (surprisingly, to me) as well as the impact of the 2008 housing crisis.
Correlation for the continuous variables showed a number aligned with the target variable, and so likely important to the final model. These included the size-related features, as well as features related to the age of the house and/or year remodeled.
Because my target variable was continuous, I focused on linear regression and tree-based models. I ran OLS, Ridge, Lasso and ElasticNet regressions, using grid search and cross-validation to determine the best parameters and to minimize overfitting.
The error functions for these models on the test data set were all similar. The R^2 for Ridge, Lasso and ElasticNet were all in the range 0.89=0.91. The significance test for the OLS coefficients indicated that a number of them were very significant, which others were not.
I also ran tree-based models Gradient Boosting (GBM) and XGB (also gradient boosting) and looked at Feature Importances. The tree-based models performed similarly to the linear regression models, and pointed to similar features and being significant. Of course, this is what we'd expect.
In the end, the most important features across all the models were: OverallQual (quality of the house), GrLivArea (above-ground size), YearRemodAdd (when it was remodeled), NeighType_4 (high-end heighborhood), and the quality of the kitchen, finished basement and exterior. If nothing else, the model fits our intuition.
Feeding the features through the model one by one, additively, it became obvious that the most oomph came from 20-25 key features, with the rest more like noise.
Plotly Dash App
Settling on an ordinary multiple linear regression as the most intuitive and better-performing choice, I ended up with an array of 23 features with coefficients and an intercept. To use them in my app, I had to de-normalize them as well as the target.
The app was aimed at a homeowner who wanted to know the value of certain common improvements. The coefficients in my linear model were the link here: each coefficient represented the impact of a UNIT CHANGE in that feature on the TARGET (price), assuming all other variables stayed the same. So I just had to come up with sensible UNITS for the user to toggle and the impact on the TARGET (price) was as simple as multiplying the UNIT by the COEFFICIENT.
Since I was building an app and not trying to win a data science prize, I focused on the significant features that would be most interesting to a remodeler. Some things you can't change: Neighborhood and Year Built, for example.
But some things you can: in particular, features related to Quality would be in the app. These were Basement, Kitchen and Exterior Quality. Other features could be changed with significant investment: Baths (can add a bath), Wood Deck (can add a deck or expand one), and even house size (since you can -- if you're crazy and rich -- tack on an addition somewhere).
I also included a couple Y/N features since they could affect the price: Garage and Central Air.
Plotly Dash: I knew Plotly as a way to create interactive plots on Python, but the Dash framework was new to me. It was introduced to me by one of our instructors and an active online community and examples also came to my aid.
Dash is similar to R/Shiny in that it has two key components: a front-end layout definer, and a back-end interactivity piece. These can be combined into the same file or separated (as they can in Shiny).
The general pseudo-code outline of a Dash App starts with importing the libraries and components. It's written in Python, so you'll need numpy and pandas as well as Plotly components for the graphs.
A couple of things took some adjustment. Dash converts HTML tags into HTML when they appear after "html.", which is straightforward. I included Div tags, H2 (headers) and P (paragraph) tags. Different sections of the Div are called "children" (as they are in HTML). I used "children" here because I wanted to have two columns -- one for the inputs and the second for the graph of adjusted house prices.
The rows and columns can be done in a couple different ways, but I used simple CSS style sheet layouts (indicated by "className").
Interactivity is enabled by the layout and the callbacks. In the layout, as in Shiny, I could specify different types of inputs such as sliders and radio buttons. There aren't as many or as attractive a selection as you find in Shiny, but they get the job done. I used sliders and buttons, as well as input for the avg starting price.
The functional part of the app starts with "@app.callback," followed by a list of Outputs (such as figures/charts to update) and Inputs (from the layout above). The way Dash works, these Inputs are continually monitored, as via a listener, so that any change to a slider or other input will immediate run through the function and change the Outputs.
Inputs and Outputs
Callbacks are easy enough to get, but what follows took some practice. Right after the @app.callback section, there's one or more functions. The parameters of the function represent the Inputs from @app.callback, but they can have any name. The name doesn't seem to affect their order, since they're read in the same order as the callback Input list.
This function includes any calculations you need to do on the Inputs to arrive at the Outputs, and if the Output is a graph or plot -- as it almost always is -- there's also a plot-rendering piece.
In my case, I wanted the output to be a normal distribution of potential new values for the sale price, given the changes in the Inputs. For example, if someone moved the "Overall Quality" slider up a notch, I needed to generate a new average saleprice, a difference (from the baseline) and a normal distribution to reflect the range.
I did this by turning each coefficient into a factor, adding up the factors and multiplying by current sale price. I then generated a random normal distribution using Numpy with the new target as the mean and a standard deviation based on the original model.
The final dashboard looked like this:
Granted, there are a lot of caveats around the app. It's based only the Ames housing dataset, so other areas of the country at different times might see different results. It requires estimating a 'starting price' that assumes the default values are true, and this estimate might be difficult to produce. But as a proof of concept, it has potential.
I think there's definitely a data-driven way to estimate the value of some common improvements. This would help a lot of us who are thinking of selling into the housing boom and wonder how much we can reasonably spend on a kitchen remodel before it turns into red ink.