Analyzing Data to Predict Housing Prices in Ames, Iowa

Posted on Mar 9, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Portfolio | GitHub | Codes | LinkedIn

Introduction

Data has shown that house flipping is a common real estate investment strategy by purchasing a property and selling it in the hopes of making a profit. This can mean that sometimes, flipping a house means that the temporary owner has to make a lot of repairs or renovations until the owner can sell it for more than the investment cost. Hence, the goal is to buy low and sell high.

However, house flipping can sometimes be financially risky due to the uncertainty of the market. As data scientists, we approached this machine learning project with a two-fold goal in mind: first, we want to explore which housing characteristics are correlated with sale price per square feet in Ames; and second, we aim to build a model for future sale price estimation to understand which features make first the most impactful renovations to ultimately provide greater transparency to homeowners or house flippers.

Background: Understanding Ames, Iowa and Ames' Housing Market

Before diving into the project's details, it is important to discuss a brief background of Ames, Iowa, to understand Ames' housing market better. Based on a United States Census Bureau report in 2010, Ames, Iowa had a population of approximately 59,000. Also, Ames, Iowa economy and demographics is largely defined by the Iowa State University, a public research university located in the middle of the city. More than 75%  of Ames' population is either studying as a student or working as a faculty at Iowa State University, making Ames one large extended campus (more information at this website).

Therefore, it isn’t surprising that Ames's largest employer is Iowa State University, which employed approximately 20% of total employment. Hence, just like many college towns, Ames' real estate market is defined by a substantial proportion of rental properties, explaining the housing market's stability in Ames.

Analyzing Data to Predict Housing Prices in Ames, Iowa

if we look at the Ames housing price distribution graph, the graph indicates that Ames housing prices have more outliers on the expensive side. When we look at Ames housing market trend from 2006 to 2010, Ames housing market is relatively stable in terms of per square foot pricing over the years.

Analyzing Data to Predict Housing Prices in Ames, Iowa

If we look at the map, we can see that the cheaper homes are generally located in the city centers or generally located around the Iowa State University campus. The more expensive houses are located in the northern part of the city. In general, it seems like the cheaper houses are clustered and the expensive houses are clustered together.

About the Data

The data contains 2558 observations and 190 features on homes sold in Ames, Iowa from 2006 to 2010. Within the features, we carefully selected a subset of these features and engineered some of our own features to simplify and sharpen our subsequent models' focus. We also ran random forest and lasso regression to further select our features before finalizing our features into our learning and tree-based machine learning models.

Data Cleaning

After carefully reviewing the documentation on each variable, we initially went through the imputation process. Most of the processes were on missing variables - variables having N/A values that corresponded to the absence of a feature. These values were either replaced by a string - None - or 0 depending on the type of the variable. For example, the missing value in the continuous variable GarageArea was imputed to 0 as it was assumed that the absence of a value most likely entailed the absence of a garage.

Exploratory Data Analysis

We conducted graphical and numerical exploratory data analysis to understand the dataset and the relationships between the features and our target variable, sale price per square foot. While no two homes are the same, price per square foot is helpful when comparing similar homes in the same neighborhood.

Due to many housing features, all features and analyses will not be discussed in this post. Instead, the post will focus on a select few features for exploratory data analysis, feature selection, and feature engineering based on the correlation heatmap. We will explore several features that might impact sale price per square foot for future discussion and break this down into 5 different categories: neighborhood, house size, house age, house features, and other features.

Neighborhood Data 

Analyzing Data to Predict Housing Prices in Ames, Iowa

As mentioned above, the average Ames housing sale price differs based on the neighborhood. Neighborhoods around Iowa State University and the city center are normally cheaper while the Northern neighborhood - Gilbert and Grand Ave/30th St - are expensive. Therefore, as an investor or as a homeowner who is into house flipping, it is important to understand what neighborhood you are investing in. 

House Size Data 

The plot above shows the sales price per square foot against the total living area. Based on the graph, there is a strong positive relationship between these two variables. In general, the larger the living area, the higher the sale price per square foot.  

House Age

Orange - old; Yellow - fairly old; Green - fairly new; Dark green - new

Again, if we look at the map, similar to the previous maps where we looked at the prices, if we take out fairly new and fairly old houses, we can see that the new houses are relatively more away from the college campus and new houses are clustered around the northern neighborhood and the old houses are clustered around the city center – a similar pattern we saw with the average sale price per square feet. The graph also shows that the more recent houses were built on the outskirts of Ames which suggests that the city is expanding outward.

House Features and Others

Based on the graph above, in terms of additional house features such as heating quality, exterior quality, and fireplace quality, the better the quality the higher the sale price.

Feature Engineering

Based on what we observed in our exploratory data analysis, we created several new features to reduce dimensionality and to better explain and predict sale price.

For example, Basement Total Finished Square Feet is the total basement area that is finished and Building Age is calculated as Year Sold - Year Built. These newly created features are highly correlated with the sale price and these features will be used as our predictors for our models.

Machine Learning Models

We implemented several machine learning models for different purposes. We first started with Lasso for empirical feature selection. Then we created two predictive models - one linear and one non-linear model. Finally, we ran a multiple linear regression model to find which features make for the most impactful renovations.

  1. Lasso

For the purpose of empirical feature selection, we started with a Lasso model. Lasso favors less complicated models by introducing a penalty term on predictor coefficients that gradually approach zero as the penalty term increases. By deciding the appropriate penalty term, which is decided by the hyperparameter lambda, certain predictor coefficients would be sent to zero while others remained non-zero. Predictors correlated with other predictors would have their overall impact regulated.

Based on our grid search with cross-validation, we selected the Lasso model that fit the dataset well without overfitting. The model reduced the number of predictors from the original dataset down to 81 features that include numerical variables that were highly correlated with the sale price per square foot, such as GrLivArea, OverallQual, OverallCond, GarageArea, and categorical variables such as Neighborhood, GarageType, and HouseStyle. Please note our GitHub repository for more information.

2. Elastic Net

With the selected features from our Lasso model, we ran an elastic net model to predict sale price. Using grid search and cross-validation, we chose parameters that fit well without overfitting. Our best parameters were Lambda = 1e-6 and L1 ratio = 1.0. This means that our elastic net model ended up behaving like a lasso model.

3. Random Forest

For our next model, we selected a random forest as our non-linear predictive model as it is a well-tested tree-based model that is robust to overfitting. However, Compared to our other linear regularized models, our random forest model performance declined mainly because the house prices seem to have intrinsic linearity. Intuitively, the value of a house will typically increase as features are added or improved. House value will decrease as features are removed.

4. Multiple Linear Regression

What can a homeowner do to increase the value of their property? In order to successfully flip a house, or in other words, if a homeowner wanted to make some renovations for profit, which ones would have the greatest impact on Sale Price?

In order to answer these questions, we finally ran a multiple linear regression model on a particular subset of predictors. Multiple linear regression was chosen for the interpretability and simplicity that its coefficients tell. In multiple linear regression, for every 1 unit increase in a given feature, you can expect the target variable to increase by the value of that feature's coefficient. This allows for easy interpretation; hence, straightforward insight for homeowners. 

We started with the list of 81 features provided by our Lasso model for house renovations. Because Lasso is nothing more than penalized linear regression, it makes sense to use Lasso's output features as our multiple linear regression model's input features. As a result, our model earned a train score of 0.912, which gives us confidence in the model's ability to explain the data, and ultimately its choice for the most important features.

Additional Insights

Based on the model, we would hope that when deciding which renovations to make for a successful house flipping project and investment, a homeowner or investor in Ames, Iowa might choose to consider the following features: total Living Area, Distance from Iowa State University, Overall Quality and condition, Garage Area, Number of bathrooms, Kitchen Quality, Heating quality, Basement exposure, Fireplace, Exterior quality. 

In addition, in terms of quality, the single most important factor in selling a home, the overall quality, material, and finish of the house. If one is prioritizing areas to remodel, outdoor finishes, followed by indoor finishes and finally basement finishes may be the best approach. If remodeling over several years with plans to sell the home in the future, Exterior Quality has the advantage of staying in style many decades longer than interior finishes. Therefore, it may be important to prioritize the order of interior finishes so that the most outdated areas of the home will be those that contribute less strongly to Sale Price, given that the years since the last remodel also influence sale price.

In addition, for a simpler renovation, homeowners or investors could increase the finished percentage of their basement and could attract more buyers willing to spend more for a fully finished property.

Conclusion

Overall, our analysis showed that regularized linear model makes better predictions than a tree-based model, and we were able to get a list of features ranked by value of importance for homeowners looking to add value to their property with renovations or for investors who are also looking for a house-flipping project to make profits.

About Author

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI