Analyzing Data to Predict Housing Prices in Ames, Iowa
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Portfolio | GitHub | Codes | LinkedIn
Introduction
Data has shown that house flipping is a common real estate investment strategy by purchasing a property and selling it in the hopes of making a profit. This can mean that sometimes, flipping a house means that the temporary owner has to make a lot of repairs or renovations until the owner can sell it for more than the investment cost. Hence, the goal is to buy low and sell high.
However, house flipping can sometimes be financially risky due to the uncertainty of the market. As data scientists, we approached this machine learning project with a two-fold goal in mind: first, we want to explore which housing characteristics are correlated with sale price per square feet in Ames; and second, we aim to build a model for future sale price estimation to understand which features make first the most impactful renovations to ultimately provide greater transparency to homeowners or house flippers.
Background: Understanding Ames, Iowa and Ames' Housing Market
Before diving into the project's details, it is important to discuss a brief background of Ames, Iowa, to understand Ames' housing market better. Based on a United States Census Bureau report in 2010, Ames, Iowa had a population of approximately 59,000. Also, Ames, Iowa economy and demographics is largely defined by the Iowa State University, a public research university located in the middle of the city. More than 75% of Ames' population is either studying as a student or working as a faculty at Iowa State University, making Ames one large extended campus (more information at this website).
Therefore, it isn’t surprising that Ames's largest employer is Iowa State University, which employed approximately 20% of total employment. Hence, just like many college towns, Ames' real estate market is defined by a substantial proportion of rental properties, explaining the housing market's stability in Ames.
if we look at the Ames housing price distribution graph, the graph indicates that Ames housing prices have more outliers on the expensive side. When we look at Ames housing market trend from 2006 to 2010, Ames housing market is relatively stable in terms of per square foot pricing over the years.
If we look at the map, we can see that the cheaper homes are generally located in the city centers or generally located around the Iowa State University campus. The more expensive houses are located in the northern part of the city. In general, it seems like the cheaper houses are clustered and the expensive houses are clustered together.
About the Data
The data contains 2558 observations and 190 features on homes sold in Ames, Iowa from 2006 to 2010. Within the features, we carefully selected a subset of these features and engineered some of our own features to simplify and sharpen our subsequent models' focus. We also ran random forest and lasso regression to further select our features before finalizing our features into our learning and tree-based machine learning models.
Data Cleaning
After carefully reviewing the documentation on each variable, we initially went through the imputation process. Most of the processes were on missing variables - variables having N/A values that corresponded to the absence of a feature. These values were either replaced by a string - None - or 0 depending on the type of the variable. For example, the missing value in the continuous variable GarageArea was imputed to 0 as it was assumed that the absence of a value most likely entailed the absence of a garage.
Exploratory Data Analysis
We conducted graphical and numerical exploratory data analysis to understand the dataset and the relationships between the features and our target variable, sale price per square foot. While no two homes are the same, price per square foot is helpful when comparing similar homes in the same neighborhood.
Due to many housing features, all features and analyses will not be discussed in this post. Instead, the post will focus on a select few features for exploratory data analysis, feature selection, and feature engineering based on the correlation heatmap. We will explore several features that might impact sale price per square foot for future discussion and break this down into 5 different categories: neighborhood, house size, house age, house features, and other features.
Neighborhood Data
As mentioned above, the average Ames housing sale price differs based on the neighborhood. Neighborhoods around Iowa State University and the city center are normally cheaper while the Northern neighborhood - Gilbert and Grand Ave/30th St - are expensive. Therefore, as an investor or as a homeowner who is into house flipping, it is important to understand what neighborhood you are investing in.
House Size Data
The plot above shows the sales price per square foot against the total living area. Based on the graph, there is a strong positive relationship between these two variables. In general, the larger the living area, the higher the sale price per square foot.
House Age
Again, if we look at the map, similar to the previous maps where we looked at the prices, if we take out fairly new and fairly old houses, we can see that the new houses are relatively more away from the college campus and new houses are clustered around the northern neighborhood and the old houses are clustered around the city center – a similar pattern we saw with the average sale price per square feet. The graph also shows that the more recent houses were built on the outskirts of Ames which suggests that the city is expanding outward.
House Features and Others
Based on the graph above, in terms of additional house features such as heating quality, exterior quality, and fireplace quality, the better the quality the higher the sale price.
Feature Engineering
Based on what we observed in our exploratory data analysis, we created several new features to reduce dimensionality and to better explain and predict sale price.
For example, Basement Total Finished Square Feet is the total basement area that is finished and Building Age is calculated as Year Sold - Year Built. These newly created features are highly correlated with the sale price and these features will be used as our predictors for our models.
Machine Learning Models
We implemented several machine learning models for different purposes. We first started with Lasso for empirical feature selection. Then we created two predictive models - one linear and one non-linear model. Finally, we ran a multiple linear regression model to find which features make for the most impactful renovations.
-
Lasso
For the purpose of empirical feature selection, we started with a Lasso model. Lasso favors less complicated models by introducing a penalty term on predictor coefficients that gradually approach zero as the penalty term increases. By deciding the appropriate penalty term, which is decided by the hyperparameter lambda, certain predictor coefficients would be sent to zero while others remained non-zero. Predictors correlated with other predictors would have their overall impact regulated.
Based on our grid search with cross-validation, we selected the Lasso model that fit the dataset well without overfitting. The model reduced the number of predictors from the original dataset down to 81 features that include numerical variables that were highly correlated with the sale price per square foot, such as GrLivArea, OverallQual, OverallCond, GarageArea, and categorical variables such as Neighborhood, GarageType, and HouseStyle. Please note our GitHub repository for more information.
2. Elastic Net
With the selected features from our Lasso model, we ran an elastic net model to predict sale price. Using grid search and cross-validation, we chose parameters that fit well without overfitting. Our best parameters were Lambda = 1e-6 and L1 ratio = 1.0. This means that our elastic net model ended up behaving like a lasso model.
3. Random Forest
For our next model, we selected a random forest as our non-linear predictive model as it is a well-tested tree-based model that is robust to overfitting. However, Compared to our other linear regularized models, our random forest model performance declined mainly because the house prices seem to have intrinsic linearity. Intuitively, the value of a house will typically increase as features are added or improved. House value will decrease as features are removed.
4. Multiple Linear Regression
What can a homeowner do to increase the value of their property? In order to successfully flip a house, or in other words, if a homeowner wanted to make some renovations for profit, which ones would have the greatest impact on Sale Price?
In order to answer these questions, we finally ran a multiple linear regression model on a particular subset of predictors. Multiple linear regression was chosen for the interpretability and simplicity that its coefficients tell. In multiple linear regression, for every 1 unit increase in a given feature, you can expect the target variable to increase by the value of that feature's coefficient. This allows for easy interpretation; hence, straightforward insight for homeowners.
We started with the list of 81 features provided by our Lasso model for house renovations. Because Lasso is nothing more than penalized linear regression, it makes sense to use Lasso's output features as our multiple linear regression model's input features. As a result, our model earned a train score of 0.912, which gives us confidence in the model's ability to explain the data, and ultimately its choice for the most important features.
Additional Insights
Based on the model, we would hope that when deciding which renovations to make for a successful house flipping project and investment, a homeowner or investor in Ames, Iowa might choose to consider the following features: total Living Area, Distance from Iowa State University, Overall Quality and condition, Garage Area, Number of bathrooms, Kitchen Quality, Heating quality, Basement exposure, Fireplace, Exterior quality.
In addition, in terms of quality, the single most important factor in selling a home, the overall quality, material, and finish of the house. If one is prioritizing areas to remodel, outdoor finishes, followed by indoor finishes and finally basement finishes may be the best approach. If remodeling over several years with plans to sell the home in the future, Exterior Quality has the advantage of staying in style many decades longer than interior finishes. Therefore, it may be important to prioritize the order of interior finishes so that the most outdated areas of the home will be those that contribute less strongly to Sale Price, given that the years since the last remodel also influence sale price.
In addition, for a simpler renovation, homeowners or investors could increase the finished percentage of their basement and could attract more buyers willing to spend more for a fully finished property.
Conclusion
Overall, our analysis showed that regularized linear model makes better predictions than a tree-based model, and we were able to get a list of features ranked by value of importance for homeowners looking to add value to their property with renovations or for investors who are also looking for a house-flipping project to make profits.