Predicting Ames Housing Sale Prices Using Machine Learning Models
Ames is the 9th largest city in Iowa, with 66,427 residents based on the 2020 census. Iowa State University made up approximately half of the city's population and is the largest employer.
Putting myself in the shoes of a data scientist for an online real estate database company that provides house price estimation (like Zillow home price estimates), I performed the analysis and the goals were to,
- Investigate possible major factors and features that influence house sale prices through exploration and analysis of Ames housing sales data
- Build predictive machine learning models to make accurate predictions
The Ames house price dataset used for the analysis has 2,580 records of house sales from 2006 to 2010. Its 81 attributes cover sales prices and a wide range of characteristics, including exterior and internal features, conditions, quality, etc. Additionally, the real estate market data provides the longitude and latitude of each house which supported the analysis of neighborhoods.
Exploratory Data Analysis
The analysis started with EDA to understand housing sales and their relations with different house features. Of the 81 attributes, 79 of them are the characteristics of the houses. To better analyze features one by one and detect the connections between similar features, I grouped them into different categories.
- Other exterior
- Quality & Condition
- No. of rooms
For each feature group, I checked variable distributions and visualized their relationships with house sale prices. Specifically, I drew scatter plots for numerical features; whereas, for the categorical features, box plots of sale prices by the categories of the variable were used. Variables selected for modeling are based on the following rules,
- Keep attributes that have clear relationships with sale prices
- Drop attributes showed no relationship with prices both technically and intuitively
- Drop attributes correlated with other independent variables. Only keep the major one among the correlated independent variables for modeling to avoid multicollinearity
- Keep attributes with ambiguous relationships with sale prices for further technical analysis
Along with the above analysis, I explored interesting patterns in sales prices and their relationships with other features. Major highlights are as follows,
1) Sale prices
Sale prices are right-skewed with a long tail from $200k to $755k. The middle 50% of prices range from $130k to $210k, and the median is about $160k.
Regarding the sale year, there were no major differences in price and count over years, even during the financial crisis in 2008-2009.
The seasonality in sales activities is obvious. The house sale market is more active in summertime (June & July), but this did not associate with higher prices.
For the 19 neighborhoods in Ames, there exist 2 major clusters on the west of the university and the north side of Ames. North Ames (the pink spots on the graph) is the neighborhood with the most sales through the years.
I visualized sale prices and ages by neighborhoods (sorted by the distance to Ames downtown) by violin plots. The result shows that neighborhoods closer to downtown generally have older houses.
Northridge and Northridge Heights, the two brand-new neighborhoods (average age of 2 years), also have the highest average prices. On the other hand, IDOTRR (railroad) and Southwest of ISU, the two oldest neighborhoods have mid-to-low-end sale prices on average.
Two types of features were found to be highly correlated with prices, size-related features (and the number of rooms), and quality and condition features.
The living area above ground, the total number of rooms above ground (excluding bathrooms), and the area of the garage are all correlated with sale prices.
4) Quality & condition
As for the quality features, better qualities are associated with high prices intuitively. But for the different condition levels, the relationship is not significant, as the conditions are very concentrated at the middle level (5/10), which distorts the relationship with sale prices.
- Pools – Excellence of pools is associated with significantly higher prices. When modeling, either transforming this variable to a Y/N feature or tree-based models is able to better handle this pattern.
- Number of fireplaces - Might be correlated with house size as larger houses may have more fireplaces.
- Utilities and Exterior - houses with public utilities generally have higher prices than those with septic tanks. And sale prices vary among different exterior types.
New features were created or condensed to assist the analysis.
- Age of the house- Year that the house was sold – Year house was built or remodeled
- Age of garage- same approach
- of bathrooms- adding the full bathrooms and half bathrooms
- of bathrooms in the basement- same approach
- Low-quality area above ground / Total area- this “low-quality ratio” is considered as it negatively affects the house’s sale price
With all the features selected and created by this step, I first label encoded ordinal variables into numerical. Then for all the numerical features, two analyses were conducted to further select features,
- correlation matrix among all numerical features to detect and resolve multicollinearity
- univariate analysis with house sale prices to refine features selection
Through the selection process, 58 numerical and categorical features are used for modeling prior to dummifications.
The data is split by 70%-30% for training and testing respectively. Seven different models were fitted including linear models, non-linear models and tree-based models (MLR, ridge, lasso, elastic-net, SVR, random forest and XGBOOST). Those models are tuned by the grid search and the results are as follows,
- XGBOOST is the best model with the best performance both on training and testing
- From the prediction perspective, a tree-based model in general performs better than linear models.
To better understand the model, I checked the feature importance for Lasso model and XGBoost model even though the results may not be intuitive as predictive models. Lasso is the good modeling choice for feature selection and XGBoost the is best predictor.
Although some important features cannot be explained by common sense, it is clear that overall quality, garage capacity, above-ground living area, number of fireplaces, etc. are the major price influencers in predictive models as well, which is also consistent with EDA conclusions.
As for the next steps of this study, the analysis would dig deeper into feature selections and feature engineering to generate optimal features for modeling. Additionally, modeling tuning is time-consuming. The analysis could be expanded with a wider range of hyperparameter selections to further improve predictive performance.