Data Analysis on Housing Prices in Ames
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
AMES Housing Project
- A minimalist approach to predicting housing prices in Ames through data analysis.
Real estate is a tricky game for real estate investors. Finding the next hotspot, the next trendy city, the hottest neighborhoods comes down to a balance between intuition and data. But what happens when you do find a neighborhood that is growing in popularity and consequently housing prices are rising, meaning a potential investment could be fruitful. How do you evaluate a house and whether or not that house is undervalued or overvalued. The depth and breadth of data provided by Dean De Cook in regards to the houses in Ames, Iowa give us an opportunity to explore just how we do that.
Exploratory Data Analysis
The Ames dataset is vast and has a large quantity and vast range of variables. The data set includes over 80 features, 20 Ordinal Features, 25 Categorical Features, and 36 numerical features. To make a predictive model, it is important to understand which of these features are more relevant, which are less relevant, and the multicollinearity of the variables themselves.
We begin with an overview of how each feature correlates to the feature we would like to predict, which is SalePrice. Here is a diagram of the highest correlated features:
We can see that the highest correlated values in relation to SalesPrice are OverallQual and GrLivingArea (sq ft). This appears intuitive as a larger house would be more expensive and a higher quality house would be more expensive, however, OverallQual is a vague, and ambiguous feature as we don't really know what it means or how it is constructed.
Let's take a look at how OverallQual measures against SalesPrice with a BoxPlot:
One thing we notice off that bat is that there is a positive relationship between OverallQual and SalePrice, however, this relationship becomes obscured as OverallQual increases. The spread of values increases and OverallQual becomes a weaker predictor.
Next, we look at how GrLivArea relates to SalePrice:
We can see there is a linear relationship, and when we remove some outliers, the relationship strengthens.
While GrLivArea is a moderately strong predictor, with a correlation of ~.75 even with the removal of the outliers let's see if we can do better using more features to predict the SalePrice and applying Regularization to limit the multicollinearity amongst the variables.
Preprocessing The Data
We first look at a distribution of the SalePrice:
We see that SalePrice has a rightwards skew. In order to better predict SalePrice, we know that working with results that form a normal distribution will make our models stronger, so we apply a log-np transformation:
Next, we take a look at how much skew other features have and aim to reduce their skew with the same log-np transformation. Using Python we identify that the following features have skewness values greater than .6:
We correct these skews using the log-transformations, however, some features can not be normalized and some are considered irrelevant to our model, so we drop the following:
Finally, for missing values related to the numerical variables, we fill them in using the mean of those columns.
Are Preprocessing is complete, let's move on to the model!
Because of the high amounts of multicollinearity in the data set, we will apply Ridge regularization. The disadvantages of this are that we will introduce more bias to our data set.
Applying python we gather some evidence that our Ridge regularization has improved our linear regression.
Using Linear Regression without Ridge regularization our rmse was .25. Therefore, we conclude that Ridge w/ Linear regression is a solid predictor! As a minimalist data scientist, we are satisfied and move forward!
Data Takeaways and further enhancements
Additional important steps to take are to extract the most important features that influence our model and understand how unit changes of these features affect the Sales Price.
Also, we would like to try out Lasso Regression and see how it compares to Ridge. We would like to analyze the multicollinearity of the features more and understand which variables are influencing each other the most to mitigate the bias this introduces.
Finally, we would like to derive advanced methods of filling in missing values instead of prescribing to using "Mean" to fill in the empty data.