Finding Homes to Rent in College Towns

Douglas Pizac
Gabriela Huelgas Morales
Posted on Mar 15, 2021

Github Repo

Gabriela's LinkedIn

Douglas's LinkedIn



Homecoming weekend drives university alumni and destination tourists into college towns. The experiences of Homecoming attendees include event activities and programs, touring the college town, and hospitality services (1). An example of the economic forces behind Homecoming is that 18,000 visitors occupied all 600 hotel rooms in the city of a local University of Arkansas (2). Similarly, homecoming attendees in Lane College in Jackson, TN, spent around $1 million in local hotels and restaurants over that single weekend.

In this project, we propose a new Experience Factor that applies to college towns and capitalizes on untapped revenue from the lodging associated with University events. Our goals were:

  1. Identify features that most affect housing value
  2. Predict housing values to tie them into the daily rate calculation
  3. Describe houses that meet the quality expectations of the Airbnb brand
  4. Use these predictions on identified houses to project potential revenue

To fulfill our goals, we focused on the college town of Ames, Iowa. The Ames, Iowa dataset from Kaggle was our source of information about the homes surrounding the Iowa State University campus.


Data Cleaning and Imputation

First, we needed to add location data (latitude and longitude) to our existing dataset. With these, we could calculate the distance to important landmarks such as downtown, the airport, and Jack Trice Stadium, the Iowa State University stadium that hosts most of the major football and alumni events.

Once the datasets were merged, we had to decide how we would deal with missing data. Based on prior knowledge of the dataset, we knew that certain columns with missing data were not truly missing. For example, a missing value for garage quality was not truly missing; it instead meant that the house did not have a garage as part of the property. Therefore, we created a function that passed in those columns that had similar missing properties as garage quality and imputed that the house feature did not exist (‘DNE’).

For certain features such as lot frontage or the year the garage was built (f there was a garage), we wanted to take a different approach than simply using the average. We posited that the development of neighborhoods in Ames occurred at different times and had different zoning requirements. If these assumptions were true, then homes in the same neighborhood would have similar characteristics and provide more accurate values to fill missingness. For any numerical features that we believed were independent of the neighborhood, we filled with the mean, while for categorical featured we imputed missing values with the mode.


Feature Selection

Originally, we utilized a linear regression model because we wanted to have a level of interpretability of home prices that would be less available in other machine learning models. However, we already had a large number of predictors, and adding more through encoding categorical features it would cause our model to overfit. Therefore, we used Lasso regression for feature selection and eliminated 13 predictors from our model.

We selected the Lasso penalty over Ridge because we wanted the regression coefficients of less important predictors to reduce to zero. Post-feature selection, we ran a linear regression and found that the most important features in predicting sale price were: Overall home quality, overall home condition, above-ground living area square feet, and how new the home was (Figure 1).

Figure 1. Standardized coefficients between features and log(SalePrice)


Gradient boosted model

In our linear regression model, we identified the most influential features on housing price. However, two outliers in the data reduced the prediction power of our model. This was especially detrimental to our goals because our estimation of the daily home-sharing rate is dependent on the value of the home. We calculated that every $10,000 our model underpredicts the true sale price is an additional 20 cents per night of lost revenue. To minimize our prediction error, we decided to implement a non-linear, tree-based model. Using gradient boosting and 5-fold cross-validation, our model produced an R^2 value of 0.91 and an RMSE of a normalized (log) sale price of 0.118.

Upon further evaluation of our gradient boosted model, it did very well at predicting prices of lower-valued homes but was less effective at pricing more expensive properties. This makes sense given the distribution of home prices in Ames is skewed towards the less expensive side of the housing market. In future work, we would further tune our existing model, as well as try other parallel tree-based models such as XGBoost or Light GBM.


Houses that satisfy our quality standards

One of the long-term goals of our project is to attract new hosts to offer their properties. To do that, we described houses that meet the quality expectations of the Airbnb brand. So, we filtered out the houses in the dataset with features that rated below average (Figure 2). 

We also focused on investigating aspects that concern meeting the needs of potential guests to enhance their positive experience. First, we added relevant information to the houses, such as the number of guests they can host and the number of bathrooms. Also,  we reasoned that guests attending events such as Homecoming would benefit from knowing how far the houses are from the University’s stadium. So we added a feature that informs if the houses are walking, biking, or driving distance from the stadium (Figure 3).

Figure 2. Houses that satisfy our quality standards in Ames, IA.


Projected versus current Airbnb rates

The non-linear model we described allows us to predict the values of the houses. But how does that tie into a daily-rate calculation? The house value plays a role in determining the daily rate of Airbnb hosts. A suggestion made by the company is to base the daily rate on your monthly mortgage. We calculated the monthly mortgage of each property, using the average 30-year interest-rate for the sale year.

When compared to actual daily rates listed by the company, our daily rate numbers are smaller. We expected this underestimation since we did not take property taxes and other factors specific to each house into account. Also, there is a boost factor that increases the rates and depends on the lodging demand. We observed that the daily rates rose 2 to 8 times for Homecoming when compared to those of a non-holiday weekend. 

If all the residences we selected became hosts, Airbnb could increase its revenue by $4259 - $17035 from one Homecoming weekend, depending on the boost factor of prices. On average, each host could earn $140 - $556.72, assuming a two-night stay.

Figure 3. Predicted daily rates with information relevant to the guests.


A surprising insight derived from our analysis is that most of the homes that meet our quality standards can host 6+ guests (Figure 4). We think that a good strategy would be to incentivize hosts to split the home into individually rented rooms. This way, there would be a potential revenue increase by expanding availability for smaller guest sizes.

Figure 4. Most houses can host 6+ guests.

Future Work

  • We described that only a small proportion of the houses in the dataset meet our quality standards. We would like to explore which renovations would best improve quality and make houses suitable for Airbnb with a small investment to further increase rental supply, especially for smaller capacity homes.
  • Our results indicate that implementing the Homecoming Experience factor in Ames would be profitable for Airbnb and for the hosts. Thus, we could scale this experience factor to include other college towns within the United States, making Airbnb the primary lodging choice for major events.



  1. Hongping Zhang et al.(2018) Place attachment and attendees’ experiences of homecoming event Journal of Sport & Tourism 22:3 227-246
  2. Timothy Johns, The Financial Impact of Homecoming (2016)

About Authors

Douglas Pizac

Douglas Pizac

After spending over six years working in academic research, I am transitioning into the exciting world of data science. When I am not trying to unveil new data-driven insights, I enjoy traveling, exercise, and spending time with friends...
View all posts by Douglas Pizac >
Gabriela Huelgas Morales

Gabriela Huelgas Morales

I am a Data Scientist with a Ph.D. in Biomedical Sciences. I enjoy the challenges of solving complex problems, finding meaningful relationships within the data, and providing actionable recommendations and insights. Before joining NYCDSA, I was a scientist...
View all posts by Gabriela Huelgas Morales >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp