Data Analysis on Homes to Rent in College Towns
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Homecoming weekend drives university alumni and destination tourists into college towns. The experiences of Homecoming attendees include event activities and programs, touring the college town, and hospitality services (1). An example of the economic forces behind Homecoming is that 18,000 visitors occupied all 600 hotel rooms in the city of a local University of Arkansas (2). Similarly, data shows homecoming attendees in Lane College in Jackson, TN, spent around $1 million in local hotels and restaurants over that single weekend.
In this project, we propose a new Experience Factor that applies to college towns and capitalizes on untapped revenue from the lodging associated with University events. Our goals were:
- Identify features that most affect housing value
- Predict housing values to tie them into the daily rate calculation
- Describe houses that meet the quality expectations of the Airbnb brand
- Use these predictions on identified houses to project potential revenue
To fulfill our goals, we focused on the college town of Ames, Iowa. The Ames, Iowa dataset from Kaggle was our source of information about the homes surrounding the Iowa State University campus.
Data Cleaning and Imputation
First, we needed to add location data (latitude and longitude) to our existing dataset. With these, we could calculate the distance to important landmarks such as downtown, the airport, and Jack Trice Stadium, the Iowa State University stadium that hosts most of the major football and alumni events.
Once the datasets were merged, we had to decide how we would deal with missing data. Based on prior knowledge of the dataset, we knew that certain columns with missing data were not truly missing. For example, a missing value for garage quality was not truly missing; it instead meant that the house did not have a garage as part of the property. Therefore, we created a function that passed in those columns that had similar missing properties as garage quality and imputed that the house feature did not exist (‘DNE’).
For certain features such as lot frontage or the year the garage was built (f there was a garage), we wanted to take a different approach than simply using the average. We posited that the development of neighborhoods in Ames occurred at different times and had different zoning requirements. If these assumptions were true, then homes in the same neighborhood would have similar characteristics and provide more accurate values to fill missingness. For any numerical features that we believed were independent of the neighborhood, we filled with the mean, while for categorical featured we imputed missing values with the mode.
Data on Feature Selection
Originally, we utilized a linear regression model because we wanted to have a level of interpretability of home prices that would be less available in other machine learning models. However, we already had a large number of predictors, and adding more through encoding categorical features it would cause our model to overfit. Therefore, we used Lasso regression for feature selection and eliminated 13 predictors from our model.
We selected the Lasso penalty over Ridge because we wanted the regression coefficients of less important predictors to reduce to zero. Post-feature selection, we ran a linear regression and found that the most important features in predicting sale price were: Overall home quality, overall home condition, above-ground living area square feet, and how new the home was (Figure 1).
Figure 1. Standardized coefficients between features and log(SalePrice)
Data on the Gradient boosted model
In our linear regression model, we identified the most influential features on housing price. However, two outliers in the data reduced the prediction power of our model. This was especially detrimental to our goals because our estimation of the daily home-sharing rate is dependent on the value of the home. We calculated that every $10,000 our model underpredicts the true sale price is an additional 20 cents per night of lost revenue. To minimize our prediction error, we decided to implement a non-linear, tree-based model. Using gradient boosting and 5-fold cross-validation, our model produced an R^2 value of 0.91 and an RMSE of a normalized (log) sale price of 0.118.
Upon further evaluation of our gradient boosted model, it did very well at predicting prices of lower-valued homes but was less effective at pricing more expensive properties. This makes sense given the distribution of home prices in Ames is skewed towards the less expensive side of the housing market. In future work, we would further tune our existing model, as well as try other parallel tree-based models such as XGBoost or Light GBM.
Houses that satisfy our quality standards
One of the long-term goals of our project is to attract new hosts to offer their properties. To do that, we described houses that meet the quality expectations of the Airbnb brand. So, we filtered out the houses in the dataset with features that rated below average (Figure 2).
We also focused on investigating aspects that concern meeting the needs of potential guests to enhance their positive experience. First, we added relevant information to the houses, such as the number of guests they can host and the number of bathrooms. Also, we reasoned that guests attending events such as Homecoming would benefit from knowing how far the houses are from the University’s stadium. So we added a feature that informs if the houses are walking, biking, or driving distance from the stadium (Figure 3).
Figure 2. Houses that satisfy our quality standards in Ames, IA.
Projected versus current Airbnb rates Data
The non-linear model we described allows us to predict the values of the houses. But how does that tie into a daily-rate calculation? The house value plays a role in determining the daily rate of Airbnb hosts. A suggestion made by the company is to base the daily rate on your monthly mortgage. We calculated the monthly mortgage of each property, using the average 30-year interest-rate for the sale year.
When compared to actual daily rates listed by the company, our daily rate numbers are smaller. We expected this underestimation since we did not take property taxes and other factors specific to each house into account. Also, there is a boost factor that increases the rates and depends on the lodging demand. We observed that the daily rates rose 2 to 8 times for Homecoming when compared to those of a non-holiday weekend.
If all the residences we selected became hosts, Airbnb could increase its revenue by $4259 - $17035 from one Homecoming weekend, depending on the boost factor of prices. On average, each host could earn $140 - $556.72, assuming a two-night stay.
Figure 3. Predicted daily rates with information relevant to the guests.
A surprising insight derived from our analysis is that most of the homes that meet our quality standards can host 6+ guests (Figure 4). We think that a good strategy would be to incentivize hosts to split the home into individually rented rooms. This way, there would be a potential revenue increase by expanding availability for smaller guest sizes.
Figure 4. Most houses can host 6+ guests.
- We described that only a small proportion of the houses in the dataset meet our quality standards. We would like to explore which renovations would best improve quality and make houses suitable for Airbnb with a small investment to further increase rental supply, especially for smaller capacity homes.
- Our results indicate that implementing the Homecoming Experience factor in Ames would be profitable for Airbnb and for the hosts. Thus, we could scale this experience factor to include other college towns within the United States, making Airbnb the primary lodging choice for major events.
- Hongping Zhang et al.(2018) Place attachment and attendees’ experiences of homecoming event Journal of Sport & Tourism 22:3 227-246
- Timothy Johns, The Financial Impact of Homecoming (2016)