Data Analysis on Housing Prices in Ames

Posted on Aug 21, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

AMES Housing Project

- A minimalist approach to predicting housing prices in Ames through data analysis.

Intro

Real estate is a tricky game for real estate investors. Finding the next hotspot, the next trendy city, the hottest neighborhoods comes down to a balance between intuition and data. But what happens when you do find a neighborhood that is growing in popularity and consequently housing prices are rising, meaning a potential investment could be fruitful. How do you evaluate a house and whether or not that house is undervalued or overvalued. The depth and breadth of data provided by Dean De Cook in regards to the houses in Ames, Iowa give us an opportunity to explore just how we do that.

Exploratory Data Analysis

The Ames dataset is vast and has a large quantity and vast range of variables. The data set includes over 80 features, 20 Ordinal Features, 25 Categorical Features, and 36 numerical features. To make a predictive model, it is important to understand which of these features are more relevant, which are less relevant, and the multicollinearity of the variables themselves.

We begin with an overview of how each feature correlates to the feature we would like to predict, which is SalePrice. Here is a diagram of the highest correlated features:

Data Analysis on Housing Prices in Ames

We can see that the highest correlated values in relation to SalesPrice are OverallQual and GrLivingArea (sq ft). This appears intuitive as a larger house would be more expensive and a higher quality house would be more expensive, however, OverallQual is a vague, and ambiguous feature as we don't really know what it means or how it is constructed.

Let's take a look at how OverallQual measures against SalesPrice with a BoxPlot:

Data Analysis on Housing Prices in Ames

One thing we notice off that bat is that there is a positive relationship between OverallQual and SalePrice, however, this relationship becomes obscured as OverallQual increases. The spread of values increases and OverallQual becomes a weaker predictor.

Next, we look at how GrLivArea relates to SalePrice:

Data Analysis on Housing Prices in Ames

We can see there is a linear relationship, and when we remove some outliers, the relationship strengthens.

Data Analysis on Housing Prices in Ames

While GrLivArea is a moderately strong predictor, with a correlation of ~.75 even with the removal of the outliers let's see if we can do better using more features to predict the SalePrice and applying Regularization to limit the multicollinearity amongst the variables.

 Preprocessing The Data

We first look at a distribution of the SalePrice:

We see that SalePrice has a rightwards skew. In order to better predict SalePrice, we know that working with results that form a normal distribution will make our models stronger, so we apply a log-np transformation:

Next, we take a look at how much skew other features have and aim to reduce their skew with the same log-np transformation. Using Python we identify that the following features have skewness values greater than .6:

  • MSubClass
  • LotFrontage
  • LotArea
  • OverallCond
  • YearBuilt
  • YearRemodAdd
  • MasVnrArea
  • BsmtFinSF1
  • BsmftFinSF2
  • BsmftUnfSF
  • TotalBsmtSF
  • 1stFlrSF
  • 2ndFlrSf
  • LowQualFinSF
  • GrLivArea
  • BsmtFullBath
  • BsmtHalfBath
  • HAlfBath
  • KitchenAbvGr
  • TotRmsAbvGrd
  • Fireplaces
  • GarageYrBlt
  • WoodDeckSF
  • OpenPorchSF
  • EnclosedPorch
  • 3SsnPorch
  • ScreenPorch
  • PoolArea
  • MiscVal

We correct these skews using the log-transformations, however, some features can not be normalized and some are considered irrelevant to our model, so we drop the following:

  • ScreenPorch
  • GarageYrBlt
  • PoolArea
  • GarageArea
  • Fireplaces
  • MasVnrArea
  • 2ndFlrSF

Finally, for missing values related to the numerical variables, we fill them in using the mean of those columns.

Are Preprocessing is complete, let's move on to the model!

Model

Because of the high amounts of multicollinearity in the data set, we will apply Ridge regularization. The disadvantages of this are that we will introduce more bias to our data set.

Applying python we gather some evidence that our Ridge regularization has improved our linear regression.

Using Linear Regression without Ridge regularization our rmse was .25. Therefore, we conclude that Ridge w/ Linear regression is a solid predictor! As a minimalist data scientist, we are satisfied and move forward!

Data Takeaways and further enhancements

Additional important steps to take are to extract the most important features that influence our model and understand how unit changes of these features affect the Sales Price.

Also, we would like to try out Lasso Regression and see how it compares to Ridge. We would like to analyze the multicollinearity of the features more and understand which variables are influencing each other the most to mitigate the bias this introduces.

Finally, we would like to derive advanced methods of filling in missing values instead of prescribing to using "Mean" to fill in the empty data.

About Author

Related Articles

Leave a Comment

AMES Housing Project | DevArena August 21, 2021
[…] Source link […]

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI