Analyzing Ames Housing Data to Predict Its Prices

Posted on Aug 17, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Analyzing Housing Data to Predict Its Prices

House Sale Value Modeling

For our machine learning project, we were tasked with predicting the sale price of homes based on the Ames Housing dataset. The data set catalogues the major characteristics of 1460 houses that were sold in Ames, Iowa during an undisclosed time period.

 

Exploratory Data Analysis

For anyone who has ever purchased a house, many of the major factors governing the sale price seem pretty obvious. Location, location, location, for one. In the Ames Housing data set, the neighborhood, house style, year built, and overall quality scores immediately stood out as factors that are likely to have a major effect on the sale price. What was less clear is how the other 75 features would factor in. Upon analysis, it was apparent that the median sale price is quite significantly different for each neighborhood, as one might expect. Sale Price Distribution by Neighborhood

There is also a strong linear relationship between sale price and features such total size (blue), and overall quality (red). For other features such as YearBuilt (green), it's clear that new houses are selling for substantially higher prices, though the relationship isn't strictly linear.

Analyzing Housing Data to Predict Its Prices

 

Data Cleaning

There were a number of issues that had to be dealt with before modelling the data. First, the distribution of sale prices is not normally distributed, making it likely that my models will be error prone for the houses at the top end of the price spectrum. To correct for this, I did a log transformation. This improved the fit significantly, though the lower end of the price spectrum was still potentially going to cause some issues. Analyzing Housing Data to Predict Its Prices

Second, there is was a substantial amount of missing data across several columns. For example over 99% of the data are missing for Pool QC, likely due to most houses simply not having a pool. In other words, much of the missing data points are not missing at random.

To account for this, I imputed NA values as "no pool". Similarly, I inferred that missing values related to basement, garage, or fireplace indicated a lack of those features as well. However, in the case of other features such as the veneer type (e.g. stone/brick), missing values were rare (only 0.5% of the data), potentially suggesting that they could have been missing at random, and that it would be more appropriate to impute a specific value.

To impute these values, I first separated the data into quantitative or categorical groups. I then sub-divided the categorical data into nominal and ordinal bins. At that point, I created a database of typical values for each feature by sub-grouped each by neighborhood and identifying the most common value (categorical columns), or median value (numerical columns) for each. I then replaced the missing cells with the corresponding common values.

Numerical Conversion

Following imputation, I converted the categorical data to numerical data. To ensure that numerical assignments correlate with higher sale prices, I sorted all categorical features according to the median sales prices for that feature value and assigned each the indexed value. For example, the neighborhoods were numbered 1-25 with the most expensive neighborhoods getting the higher numbers.

Finally, outliers were removed. Specifically, the SaleCondition feature was reduced down to only Normal sales, while homes with more thatn 4000 square feet of living space were removed.

 

Data modeling

Using the cleaned data set, I trained a random forest model, and used a grid search to fine tune the parameters. Specifically, I tested a range of estimators between 200 and 1000, auto versus square root for the maximum number of features per split, and a maximum depth up to 30. To help prevent overfitting, I kept the minimum number of samples per leaf at a minimum of 3-7.

Despite several attempts at optimization, I found that the model only achieved an R-squared value of 0.86 – 0.88. However, when I graphed the errors out, it became clear that the model had a slight tendency to overestimate prices at the low end and under estimate the price of higher end homes. 

This indicated that it should be possible to correct predicted values according using a simple linear model. By plotting out the absolute error in dollars against the predicted value, I came up with a dollar amount to be added or subtracted from each of the initially predicted values. These corrected values were significantly more accurate than the initial predictions, increasing the R-squred value to 0.92.

Secondary Model Testing

Given that the random forest model wasn’t perfect, I also tested an XG Boosting model and a penalized linear regression model. Though the XG Boosting model had a similar R-squared value as the random forest model at 0.87, Lasso regression yielded a higher R-squred value (0.91) for the test data. Using cross validation, I was able to identify an optimal lambda value (~2.5) which allowed me to avoid overfitting.

In the end, the Lasso model's predictor importance function demonstrated that the strongest predictors are exactly what one would expect – overall house quality, size, and the neighborhood.

 

 

 

About Author

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI