Data Analysis on Homes to Rent in College Towns

Posted on Mar 15, 2021
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Github Repo

Gabriela's LinkedIn

Douglas's LinkedIn


Homecoming weekend drives university alumni and destination tourists into college towns. The experiences of Homecoming attendees include event activities and programs, touring the college town, and hospitality services (1). An example of the economic forces behind Homecoming is that 18,000 visitors occupied all 600 hotel rooms in the city of a local University of Arkansas (2). Similarly, data shows homecoming attendees in Lane College in Jackson, TN, spent around $1 million in local hotels and restaurants over that single weekend.

In this project, we propose a new Experience Factor that applies to college towns and capitalizes on untapped revenue from the lodging associated with University events. Our goals were:

  1. Identify features that most affect housing value
  2. Predict housing values to tie them into the daily rate calculation
  3. Describe houses that meet the quality expectations of the Airbnb brand
  4. Use these predictions on identified houses to project potential revenue

To fulfill our goals, we focused on the college town of Ames, Iowa. The Ames, Iowa dataset from Kaggle was our source of information about the homes surrounding the Iowa State University campus.


Data Cleaning and Imputation

First, we needed to add location data (latitude and longitude) to our existing dataset. With these, we could calculate the distance to important landmarks such as downtown, the airport, and Jack Trice Stadium, the Iowa State University stadium that hosts most of the major football and alumni events.

Once the datasets were merged, we had to decide how we would deal with missing data. Based on prior knowledge of the dataset, we knew that certain columns with missing data were not truly missing. For example, a missing value for garage quality was not truly missing; it instead meant that the house did not have a garage as part of the property. Therefore, we created a function that passed in those columns that had similar missing properties as garage quality and imputed that the house feature did not exist (โ€˜DNEโ€™).

For certain features such as lot frontage or the year the garage was built (f there was a garage), we wanted to take a different approach than simply using the average. We posited that the development of neighborhoods in Ames occurred at different times and had different zoning requirements. If these assumptions were true, then homes in the same neighborhood would have similar characteristics and provide more accurate values to fill missingness. For any numerical features that we believed were independent of the neighborhood, we filled with the mean, while for categorical featured we imputed missing values with the mode.


Data on Feature Selection

Originally, we utilized a linear regression model because we wanted to have a level of interpretability of home prices that would be less available in other machine learning models. However, we already had a large number of predictors, and adding more through encoding categorical features it would cause our model to overfit. Therefore, we used Lasso regression for feature selection and eliminated 13 predictors from our model.

We selected the Lasso penalty over Ridge because we wanted the regression coefficients of less important predictors to reduce to zero. Post-feature selection, we ran a linear regression and found that the most important features in predicting sale price were: Overall home quality, overall home condition, above-ground living area square feet, and how new the home was (Figure 1).

Data Analysis on Homes to Rent in College Towns

Figure 1. Standardized coefficients between features and log(SalePrice)


Data on the Gradient boosted model

In our linear regression model, we identified the most influential features on housing price. However, two outliers in the data reduced the prediction power of our model. This was especially detrimental to our goals because our estimation of the daily home-sharing rate is dependent on the value of the home. We calculated that every $10,000 our model underpredicts the true sale price is an additional 20 cents per night of lost revenue. To minimize our prediction error, we decided to implement a non-linear, tree-based model. Using gradient boosting and 5-fold cross-validation, our model produced an R^2 value of 0.91 and an RMSE of a normalized (log) sale price of 0.118.

Upon further evaluation of our gradient boosted model, it did very well at predicting prices of lower-valued homes but was less effective at pricing more expensive properties. This makes sense given the distribution of home prices in Ames is skewed towards the less expensive side of the housing market. In future work, we would further tune our existing model, as well as try other parallel tree-based models such as XGBoost or Light GBM.


Houses that satisfy our quality standards

One of the long-term goals of our project is to attract new hosts to offer their properties. To do that, we described houses that meet the quality expectations of the Airbnb brand. So, we filtered out the houses in the dataset with features that rated below average (Figure 2).ย 

We also focused on investigating aspects that concern meeting the needs of potential guests to enhance their positive experience. First, we added relevant information to the houses, such as the number of guests they can host and the number of bathrooms. Also,ย  we reasoned that guests attending events such as Homecoming would benefit from knowing how far the houses are from the Universityโ€™s stadium. So we added a feature that informs if the houses are walking, biking, or driving distance from the stadium (Figure 3).

Data Analysis on Homes to Rent in College Towns

Figure 2. Houses that satisfy our quality standards in Ames, IA.


Projected versus current Airbnb rates Dataย 

The non-linear model we described allows us to predict the values of the houses. But how does that tie into a daily-rate calculation? The house value plays a role in determining the daily rate of Airbnb hosts. A suggestion made by the company is to base the daily rate on your monthly mortgage. We calculated the monthly mortgage of each property, using the average 30-year interest-rate for the sale year.

When compared to actual daily rates listed by the company, our daily rate numbers are smaller. We expected this underestimation since we did not take property taxes and other factors specific to each house into account. Also, there is a boost factor that increases the rates and depends on the lodging demand. We observed that the daily rates rose 2 to 8 times for Homecoming when compared to those of a non-holiday weekend.ย 

If all the residences we selected became hosts, Airbnb could increase its revenue by $4259 - $17035 from one Homecoming weekend, depending on the boost factor of prices. On average, each host could earn $140 - $556.72, assuming a two-night stay.

Data Analysis on Homes to Rent in College Towns

Figure 3. Predicted daily rates with information relevant to the guests.


A surprising insight derived from our analysis is that most of the homes that meet our quality standards can host 6+ guests (Figure 4). We think that a good strategy would be to incentivize hosts to split the home into individually rented rooms. This way, there would be a potential revenue increase by expanding availability for smaller guest sizes.

Figure 4. Most houses can host 6+ guests.


Future Work

  • We described that only a small proportion of the houses in the dataset meet our quality standards. We would like to explore which renovations would best improve quality and make houses suitable for Airbnb with a small investment to further increase rental supply, especially for smaller capacity homes.
  • Our results indicate that implementing the Homecoming Experience factor in Ames would be profitable for Airbnb and for the hosts. Thus, we could scale this experience factor to include other college towns within the United States, making Airbnb the primary lodging choice for major events.



  1. Hongping Zhang et al.(2018) Place attachment and attendeesโ€™ experiences of homecoming event Journal of Sport & Tourism 22:3 227-246
  2. Timothy Johns, The Financial Impact of Homecoming (2016)

About Authors

Douglas Pizac

After spending over six years working in academic research, I am transitioning into the exciting world of data science. When I am not trying to unveil new data-driven insights, I enjoy traveling, exercise, and spending time with friends...
View all posts by Douglas Pizac >

Gabriela Huelgas Morales

I am a Data Scientist with a Ph.D. in Biomedical Sciences. I enjoy the challenges of solving complex problems, finding meaningful relationships within the data, and providing actionable recommendations and insights. Before joining NYCDSA, I was a scientist...
View all posts by Gabriela Huelgas Morales >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI