Data Analysis on Homes to Rent in College Towns

Douglas Pizac and Gabriela Huelgas Morales

Posted on Mar 15, 2021

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Homecoming weekend drives university alumni and destination tourists into college towns. The experiences of Homecoming attendees include event activities and programs, touring the college town, and hospitality services (1). An example of the economic forces behind Homecoming is that 18,000 visitors occupied all 600 hotel rooms in the city of a local University of Arkansas (2). Similarly, data shows homecoming attendees in Lane College in Jackson, TN, spent around $1 million in local hotels and restaurants over that single weekend.

In this project, we propose a new Experience Factor that applies to college towns and capitalizes on untapped revenue from the lodging associated with University events. Our goals were:

Identify features that most affect housing value
Predict housing values to tie them into the daily rate calculation
Describe houses that meet the quality expectations of the Airbnb brand
Use these predictions on identified houses to project potential revenue

To fulfill our goals, we focused on the college town of Ames, Iowa. The Ames, Iowa dataset from Kaggle was our source of information about the homes surrounding the Iowa State University campus.

Data Cleaning and Imputation

First, we needed to add location data (latitude and longitude) to our existing dataset. With these, we could calculate the distance to important landmarks such as downtown, the airport, and Jack Trice Stadium, the Iowa State University stadium that hosts most of the major football and alumni events.

Once the datasets were merged, we had to decide how we would deal with missing data. Based on prior knowledge of the dataset, we knew that certain columns with missing data were not truly missing. For example, a missing value for garage quality was not truly missing; it instead meant that the house did not have a garage as part of the property. Therefore, we created a function that passed in those columns that had similar missing properties as garage quality and imputed that the house feature did not exist (‘DNE’).

For certain features such as lot frontage or the year the garage was built (f there was a garage), we wanted to take a different approach than simply using the average. We posited that the development of neighborhoods in Ames occurred at different times and had different zoning requirements. If these assumptions were true, then homes in the same neighborhood would have similar characteristics and provide more accurate values to fill missingness. For any numerical features that we believed were independent of the neighborhood, we filled with the mean, while for categorical featured we imputed missing values with the mode.

Data on Feature Selection

Originally, we utilized a linear regression model because we wanted to have a level of interpretability of home prices that would be less available in other machine learning models. However, we already had a large number of predictors, and adding more through encoding categorical features it would cause our model to overfit. Therefore, we used Lasso regression for feature selection and eliminated 13 predictors from our model.

We selected the Lasso penalty over Ridge because we wanted the regression coefficients of less important predictors to reduce to zero. Post-feature selection, we ran a linear regression and found that the most important features in predicting sale price were: Overall home quality, overall home condition, above-ground living area square feet, and how new the home was (Figure 1).

Data Analysis on Homes to Rent in College Towns

Figure 1. Standardized coefficients between features and log(SalePrice)

Data on the Gradient boosted model

In our linear regression model, we identified the most influential features on housing price. However, two outliers in the data reduced the prediction power of our model. This was especially detrimental to our goals because our estimation of the daily home-sharing rate is dependent on the value of the home. We calculated that every $10,000 our model underpredicts the true sale price is an additional 20 cents per night of lost revenue. To minimize our prediction error, we decided to implement a non-linear, tree-based model. Using gradient boosting and 5-fold cross-validation, our model produced an R^2 value of 0.91 and an RMSE of a normalized (log) sale price of 0.118.

Upon further evaluation of our gradient boosted model, it did very well at predicting prices of lower-valued homes but was less effective at pricing more expensive properties. This makes sense given the distribution of home prices in Ames is skewed towards the less expensive side of the housing market. In future work, we would further tune our existing model, as well as try other parallel tree-based models such as XGBoost or Light GBM.

Houses that satisfy our quality standards

One of the long-term goals of our project is to attract new hosts to offer their properties. To do that, we described houses that meet the quality expectations of the Airbnb brand. So, we filtered out the houses in the dataset with features that rated below average (Figure 2).

We also focused on investigating aspects that concern meeting the needs of potential guests to enhance their positive experience. First, we added relevant information to the houses, such as the number of guests they can host and the number of bathrooms. Also, we reasoned that guests attending events such as Homecoming would benefit from knowing how far the houses are from the University’s stadium. So we added a feature that informs if the houses are walking, biking, or driving distance from the stadium (Figure 3).

Data Analysis on Homes to Rent in College Towns

Figure 2. Houses that satisfy our quality standards in Ames, IA.

Projected versus current Airbnb rates Data

The non-linear model we described allows us to predict the values of the houses. But how does that tie into a daily-rate calculation? The house value plays a role in determining the daily rate of Airbnb hosts. A suggestion made by the company is to base the daily rate on your monthly mortgage. We calculated the monthly mortgage of each property, using the average 30-year interest-rate for the sale year.

When compared to actual daily rates listed by the company, our daily rate numbers are smaller. We expected this underestimation since we did not take property taxes and other factors specific to each house into account. Also, there is a boost factor that increases the rates and depends on the lodging demand. We observed that the daily rates rose 2 to 8 times for Homecoming when compared to those of a non-holiday weekend.

If all the residences we selected became hosts, Airbnb could increase its revenue by $4259 - $17035 from one Homecoming weekend, depending on the boost factor of prices. On average, each host could earn $140 - $556.72, assuming a two-night stay.

Data Analysis on Homes to Rent in College Towns

Figure 3. Predicted daily rates with information relevant to the guests.

A surprising insight derived from our analysis is that most of the homes that meet our quality standards can host 6+ guests (Figure 4). We think that a good strategy would be to incentivize hosts to split the home into individually rented rooms. This way, there would be a potential revenue increase by expanding availability for smaller guest sizes.

Figure 4. Most houses can host 6+ guests.

Future Work

We described that only a small proportion of the houses in the dataset meet our quality standards. We would like to explore which renovations would best improve quality and make houses suitable for Airbnb with a small investment to further increase rental supply, especially for smaller capacity homes.
Our results indicate that implementing the Homecoming Experience factor in Ames would be profitable for Airbnb and for the hosts. Thus, we could scale this experience factor to include other college towns within the United States, making Airbnb the primary lodging choice for major events.

References

Hongping Zhang et al.(2018) Place attachment and attendees’ experiences of homecoming event Journal of Sport & Tourism 22:3 227-246
Timothy Johns, The Financial Impact of Homecoming (2016)

About Authors

Douglas Pizac

After spending over six years working in academic research, I am transitioning into the exciting world of data science. When I am not trying to unveil new data-driven insights, I enjoy traveling, exercise, and spending time with friends...

View all posts by Douglas Pizac >

Gabriela Huelgas Morales

I am a Data Scientist with a Ph.D. in Biomedical Sciences. I enjoy the challenges of solving complex problems, finding meaningful relationships within the data, and providing actionable recommendations and insights. Before joining NYCDSA, I was a scientist...

View all posts by Gabriela Huelgas Morales >

R Shiny Shows Decline in Even Strongest Democracies

Capstone

LendingClub Grade Optimization

Machine Learning

Boosting Real Estate Decisions

Python

Python Reveals Tunnel Traffic Patterns in Colorado

R Shiny

R Shiny: Downstream Processing Dashboard

No comments found.

Data Analysis on Homes to Rent in College Towns

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Github Repo

Gabriela's LinkedIn

Douglas's LinkedIn

Introduction

Data Cleaning and Imputation

Data on Feature Selection

Data on the Gradient boosted model

Houses that satisfy our quality standards

Projected versus current Airbnb rates Data

Future Work

References

About Authors

Douglas Pizac

Gabriela Huelgas Morales

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Data Analysis on Homes to Rent in College Towns

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Github Repo

Gabriela's LinkedIn

Douglas's LinkedIn

Introduction

Data Cleaning and Imputation

Data on Feature Selection

Data on the Gradient boosted model

Houses that satisfy our quality standards

Projected versus current Airbnb rates Data

Future Work

References

About Authors

Douglas Pizac

Gabriela Huelgas Morales

Related Articles

Leave a Comment

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!