Using Data to Predict Home Prices in Ames, Iowa

Posted on Sep 2, 2019

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Using Data to Predict Home Prices in Ames, Iowa

Data Science Introduction

What are the drivers of home prices? The neighborhood? The square footage of the home? The amenities? I analyzed a Kaggle dataset from Ames, Iowa compiled by Dean de Cock of Iowa State University to find out. The dataset contains 1,460 observations with 79 explanatory variables in train data set, and 1,459 observations with 79 explanatory variables in the test data set. The explanatory variables provide information on a home's different aspects, from the number of bathrooms and bedrooms to the size of the lot and the roof style. The outcome or response variable was of course the home's sale price.

 

Exploratory data analysis

Data types

The first step in the data analysis was examining the data types of the explanatory variables. While there were several true continuous data, such as the area variables, including basement square-footage, or the square footage of the first floor; deceptively, there were another set of variables that were set up as numeric data types, such as numbers of bedrooms and bathrooms. These are discrete counts data and not continuous.

I therefore set them up as ordinal categorial data. A third type of data in the dataset were nominal categorical data. These include zoning, sub-class, lot shape, type and style of roof, and also the quality of features such as the house itself, or its garages and basements and the fence.

Selecting the variables

As a next step, I attempted to reduce the number of variables to be included in the model. Since I have no background in real estate, and did not want to fly blind on this matter, I first obtained the correlation coefficients of all continuous explanatory variables against the sale price, and selected all variables that were correlated with the outcome at +/- 0.50 or more. For all categorical and binary data types, I ran t-tests and ANOVAs, and selected those variables which showed a significant difference in sale price between the different categories.

At the end of this exercise, I was left with 50 variables out of the initial 79 to do additional work on for this analysis.

Using Data to Analyze Variable transformations/Feature engineering

The next step in the data exploration was to look at the distribution of the different variables and to transform them as needed. The outcome or response variable, is the sale price of the home, which is a continuous variable, but with a marked right-ward skew.

 

Using Data to Predict Home Prices in Ames, Iowa

I therefore took the natural log of this variable in order to reduce the elasticity and make it appear more normal.

Using Data to Predict Home Prices in Ames, Iowa

I also logged the other continuous variables such as area in square feet. The year variable was also treated as continuous, but instead of logging it, I transformed it to duration. That is, I got the number of years between the time the house was built to the time it was last sold as per the dataset, and the number of years between the time was remodeled to the time it was last sold.

I then examined the outliers in the continuous variables, and since I did not want to drop any observations, I reset them to the mean.

For all categorical variables, I examined the frequency distribution and the cell counts. Having a background in variance estimation, I am aware that cell-sizes of 1 are very problematic, and indeed small cell-sizes are in general problematic. I therefore combined categories where I could. For example, in terms of roof style, it would appear that gabled roofs are the most popular in Ames, and a majority of the homes (1141 out of 1460 in the train data) have gabled roofs.

The frequencies of other roof-styles are much smaller. I therefore combined all the other roof styles into one category, making the variable โ€œGabled Roof โ€“ yes or noโ€. Logically this seemed fine, since it would appear that either the market forces or building codes dictate the most common roof style. I did this for all categorical data types after examining their cell frequencies. If a variable had sufficient observations in each cell, I left it as multi-category variable.

Using Data to Analyzeย Missing values

Another set of categorical variables had large numbers of missing values to indicate the absence of the feature. For example, there were large numbers of missing values for basement, garage, fence, pool, and fireplace, to indicate that the house did not have these features. In such cases, I changed the missing values to zero to better indicate the absence of that feature.

There is a caveat here however. Basement had a number of different categorical variables to describe it. However, the missing values here were inconsistent, which indicated that at least some were true missings rather than indicating the absence of a basement. I therefore set to zero only those observations where all the basement variables were missing. Anything else was treated as a true missing. Other variables such masonry-veneer-type also had multiple true missings.

All true missing values across variables were imputed using KNN Impute.

Using Data to Analyze Additional feature selection

Once the variables were transformed, I ran them through a forward AIC model to further narrow down the number of variables to be included in the final models. In the end, I had selected 31 variables to include in the final regression models.

Using Data to Analyze Regression models used

To predict the sale price of a home, I ran various models, and checked to see which provided the best prediction. The models included the following:

  1. The ordinary least squares model (OLS)
  2. Ridge regression
  3. Lasso regression
  4. Elastic-Net regression
  5. Gradient Boosting regression
  6. Decision Tree regression
  7. Random Forest regression

All models, except the OLS were run using grid-search with a 10-fold cross-validation.

Results

I checked the model fit of each model using three metrics โ€“ the R-square, the Mean Square Error (MSE) and the Root Mean Square Error (RMSE). The best fit model would have the highest R-square, and the lowest MSE and RMSE.

Hereโ€™s what these metrics looked like for each model, along with the graph of the fit:

  • OLS
    • R-square of .889
    • MSE of 0.017
    • RMSE of 0.129
Using Data to Predict Home Prices in Ames, Iowa

This model fits the data well, although there is some over and under-fitting at the extremes.

  • Tuned Ridge:
    • R-square of .870
    • MSE of 0.017
    • RMSE of 0.129
Using Data to Predict Home Prices in Ames, Iowa

As with the OLS, this model fits the data fairly well, but again with some over and under-estimation at the extremes.

  • Tuned Lasso:
    • R-square of .821
    • MSE of 0.027
    • RMSE of 0.164
Using Data to Predict Home Prices in Ames, Iowa

Unlike the OLS and Ridge models, this does not fit the data well at all.

  • Tuned Elastic-Net
    • R-square of .662
    • MSE of 0.051
    • RMSE of 0.226
Using Data to Predict Home Prices in Ames, Iowa

The metrics and the graph show clearly that this is not a good model for predicting home prices with this set of variables.

  • Gradient Boosting Regression:
    • R-square of .842
    • MSE of 0.020
    • RMSE of 0.141
Using Data to Predict Home Prices in Ames, Iowa

Although the metrics for the gradient boosting regression are not as good as the OLS and the Ridge regression, it nevertheless fits seems to fit the data very well.

  • Decision Tree:
    • R-square of .764
    • MSE of 0.017
    • RMSE of 0.129
Using Data to Predict Home Prices in Ames, Iowa

The metrics for the decision-tree show that it does not fit the data well at all. However, the graph is as expected for a decision tree, and shows the clustering of the data at different nodes.

  • Random Forest:
    • R-square of .779
    • MSE of 0.026
    • RMSE of 0.129
Using Data to Predict Home Prices in Ames, Iowa

The metrics and the graph both show that the random forest does not fit the data well at all. There is significant over and under-estimation of data points and not just at the extremes.

Models selected

Based on the metrics to determine the fit as well as the graphs, I chose the following models as being the best fit for the data:

  1. The OLS
  2. The Ridge regression
  3. The Gradient boosting regression

I submitted these to Kaggle for a score, and the OLS, which did in fact have the best fit of all the models, obtained the best score.

Concerns

From a variance estimation and traditional statistics perspective, there are a few issues to make a note of, though they matter less for prediction models.

  1. The clustering effect of the neighborhood variable.
  2. The significant spatial-autocorrelation in the dataset. This violates the assumptions of the OLS.
  3. The very large standard errors around the estimates in the OLS.

Admittedly, they matter less for prediction models, but nevertheless, it is important to be aware of these issues.

Conclusion

Overall, this was an interesting exercise, and one that really brought home to me the predictive power of the different machine learning models. In the future, I would like to run a support-vector machine regression on these data, and also try including the full feature set (appropriately transformed and adjusted).

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI