Data Analysis on Housing Prices in Ames

Abhi Singh

Posted on Aug 21, 2021

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

AMES Housing Project

- A minimalist approach to predicting housing prices in Ames through data analysis.

Intro

Real estate is a tricky game for real estate investors. Finding the next hotspot, the next trendy city, the hottest neighborhoods comes down to a balance between intuition and data. But what happens when you do find a neighborhood that is growing in popularity and consequently housing prices are rising, meaning a potential investment could be fruitful. How do you evaluate a house and whether or not that house is undervalued or overvalued. The depth and breadth of data provided by Dean De Cook in regards to the houses in Ames, Iowa give us an opportunity to explore just how we do that.

Exploratory Data Analysis

The Ames dataset is vast and has a large quantity and vast range of variables. The data set includes over 80 features, 20 Ordinal Features, 25 Categorical Features, and 36 numerical features. To make a predictive model, it is important to understand which of these features are more relevant, which are less relevant, and the multicollinearity of the variables themselves.

We begin with an overview of how each feature correlates to the feature we would like to predict, which is SalePrice. Here is a diagram of the highest correlated features:

Data Analysis on Housing Prices in Ames

We can see that the highest correlated values in relation to SalesPrice are OverallQual and GrLivingArea (sq ft). This appears intuitive as a larger house would be more expensive and a higher quality house would be more expensive, however, OverallQual is a vague, and ambiguous feature as we don't really know what it means or how it is constructed.

Let's take a look at how OverallQual measures against SalesPrice with a BoxPlot:

Data Analysis on Housing Prices in Ames

One thing we notice off that bat is that there is a positive relationship between OverallQual and SalePrice, however, this relationship becomes obscured as OverallQual increases. The spread of values increases and OverallQual becomes a weaker predictor.

Next, we look at how GrLivArea relates to SalePrice:

Data Analysis on Housing Prices in Ames

We can see there is a linear relationship, and when we remove some outliers, the relationship strengthens.

Data Analysis on Housing Prices in Ames

While GrLivArea is a moderately strong predictor, with a correlation of ~.75 even with the removal of the outliers let's see if we can do better using more features to predict the SalePrice and applying Regularization to limit the multicollinearity amongst the variables.

Preprocessing The Data

We first look at a distribution of the SalePrice:

We see that SalePrice has a rightwards skew. In order to better predict SalePrice, we know that working with results that form a normal distribution will make our models stronger, so we apply a log-np transformation:

Next, we take a look at how much skew other features have and aim to reduce their skew with the same log-np transformation. Using Python we identify that the following features have skewness values greater than .6:

MSubClass
LotFrontage
LotArea
OverallCond
YearBuilt
YearRemodAdd
MasVnrArea
BsmtFinSF1
BsmftFinSF2
BsmftUnfSF
TotalBsmtSF
1stFlrSF
2ndFlrSf
LowQualFinSF
GrLivArea
BsmtFullBath
BsmtHalfBath
HAlfBath
KitchenAbvGr
TotRmsAbvGrd
Fireplaces
GarageYrBlt
WoodDeckSF
OpenPorchSF
EnclosedPorch
3SsnPorch
ScreenPorch
PoolArea
MiscVal

We correct these skews using the log-transformations, however, some features can not be normalized and some are considered irrelevant to our model, so we drop the following:

ScreenPorch
GarageYrBlt
PoolArea
GarageArea
Fireplaces
MasVnrArea
2ndFlrSF

Finally, for missing values related to the numerical variables, we fill them in using the mean of those columns.

Are Preprocessing is complete, let's move on to the model!

Model

Because of the high amounts of multicollinearity in the data set, we will apply Ridge regularization. The disadvantages of this are that we will introduce more bias to our data set.

Applying python we gather some evidence that our Ridge regularization has improved our linear regression.

Using Linear Regression without Ridge regularization our rmse was .25. Therefore, we conclude that Ridge w/ Linear regression is a solid predictor! As a minimalist data scientist, we are satisfied and move forward!

Data Takeaways and further enhancements

Additional important steps to take are to extract the most important features that influence our model and understand how unit changes of these features affect the Sales Price.

Also, we would like to try out Lasso Regression and see how it compares to Ridge. We would like to analyze the multicollinearity of the features more and understand which variables are influencing each other the most to mitigate the bias this introduces.

Finally, we would like to derive advanced methods of filling in missing values instead of prescribing to using "Mean" to fill in the empty data.

About Author

Abhi Singh

View all posts by Abhi Singh >

AMES Housing Project | DevArena August 21, 2021

[…] Source link […]

Data Analysis on Housing Prices in Ames

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

AMES Housing Project

Intro

Exploratory Data Analysis

Preprocessing The Data

Model

Data Takeaways and further enhancements

About Author

Abhi Singh

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Analysis on Housing Prices in Ames

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

AMES Housing Project

Intro

Exploratory Data Analysis

Preprocessing The Data

Model

Data Takeaways and further enhancements

About Author

Abhi Singh

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!