Using Data Science to Analyze Airbnb Rental Market in Boston

Ben Burkey

Posted on Feb 8, 2022

Data Science to Analyze Airbnb Rental Market

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Data Science Background

Data shows that since it's inception in 2008, Airbnb has grown to become one of the largest names in hospitality paving the way as the first major peer-to-peer accommodation offering. Born from the recognition of a shortage in temporary housing options, the platform not only serves those looking for accommodations, but also those with vacant rooms, apartments, full homes, etc. looking to make an easy buck. Fast forward 10+ years and the company has solidified itself as one of the first places travelers go when visiting new places due to enhanced flexibility, feasibility, and the unique experiences available.

Airbnb Listing Prices Data

Unlike traditional hospitality avenues (hotels, hostels, etc.), Airbnb listing prices are more than meets the eye as they both closely reflect local real estate tendencies as well as directly influence such prices. Across the many studies conducted on Airbnb's effect on regional real estate trajectories, the positive correlation between the number of listings and rental prices/home prices is well established; one study reporting a 0.018% increase in rental prices and a 0.026% increase in home prices for every 1% increase in total listings (1).

With a narrowed focus on Airbnb in Boston, MA, the aim of this analysis is three fold. First, I explore surface-level insights regarding quantity of listings, trends in pricing, and guest traffic in the various neighborhoods across the city. Second, I plan to validate the relationship between rental and housing prices against the number of listings within the timeframe of the Airbnb dataset used. And third, through the use of a predictive model, develop a better understanding of the listing features that have the largest contribution to listing price.

Data Used

For this analysis, I used a dataset provided on Kaggle which comprised of Airbnb listing data between September 2016 and August 2017. Over the course of the year long collection period, data was sourced from 3585 different listings in three separate tables: listings (general listing attributes), calendar (availability of listing), and reviews (detailed guest reviews in paragraph format). For the sake of the project, I primarily used the listings table with little reference to calendar and no usage of reviews.

1. Exploratory Data Analysis (EDA)

Mean Listing Pricing vs. Time: How do listing prices fluctuate over the course of the year?

In order to better understand the behavior of listing prices over the course of the year, I first needed to prepare the calendar table. The calendar data consists of four attributes: listing_id, date, available (true / false), price. Each row of the calendar table represented the listing status for each day of the year along with its price and availability (total rows = number of listings * 365).

An immediate hurtle identified when attempting to plot the mean daily listing prices was that listing prices were not included when the listing was booked (seen in available column). In order to input those null prices, a function was applied to the table to fill null values with the price immediately preceding the days which the listing was booked.

mean-listing-price-day-144424-FpVxDDhD | Data Science Blog — **Figure 1.1**

Looking over the daily mean listing prices over the course of the year, I thought there were a couple interesting takeaways:

There clearly exists seasonal trends with the fall and summer seeing heightened prices most likely due to an increased amount of tourist activity during those months (Who wants to come to New England during the winter months?!)
On a weekly basis, average pricing noticeable increases leading into the weekend versus during the week
Slight disruptions in larger trends noticed during holiday breaks (Thanksgiving, Christmas) and a large spike for April break

Seasonal & Weekly in Data

After a closer look at mean prices relative to seasons, holiday breaks, and weekday vs. weekend, the above observations are reinforced. Looking at mean listing prices per season (Figure 1.2), Summer ($197.55) and Fall ($194.62) seasons are the clear leaders followed by Spring ($191.06) and then Winter ($179.94). Similarly, mean listing prices over the weekend are 1.4% more expensive than those during weekdays (Figure 1.3).

season-weekend-plot-569664-N7rj4YEJ | Data Science Blog — **Figure 1.2**

From Figure 1.2 above, we notice that (1) seasonality effects listing prices and (2) weekend listing prices are reliably more expensive than weekdays regardless of the season. Results of a two-way ANOVA test reinforced those observations with significant season and day_type p-values as well as an insignificant factor interaction score. From the test, we can confidently say...

Both seasonal means and day type means are not equal
There is no interaction between seasonal and day type variables

Seasonal & Holiday in Data

To better understand the impact of various holiday periods over the course of the year, I referenced a 2016-2017 school calendar and calculated mean listing prices during those periods (Christmas, Thanksgiving, Spring break, Columbus Day weekend, etc). Due to the significant effect seasonality has on pricing, simply comparing holiday middle tendencies wouldn't paint the complete picture. In the figure below (Figure 1.4), I've grouped mean holiday listing prices with mean prices of the season which the holiday falls in (Spring break and Spring, Thanksgiving and Fall, Christmas with Winter, etc).

As clearly depicted in the time series in Figure 1.1, Spring Break prices soar 5% higher than springtime averages. The remaining two holidays which display much higher mean values than the seasons they fall into include Columbus Day weekend (~4%), Christmas break (~2%), and Memorial Day weekend (~1.5%). Interestingly, Thanksgiving break falls on the opposite end of the spectrum having the biggest discrepancy with the seasonal average almost 4% higher than that of the holiday period.

In the eyes of a budget conscious tourist interested in visiting Boston before the temperatures drop below freezing, a trip in mid to late November is the way to go!

price-holiday-season-216949-rpnx9RfO | Data Science Blog — **Figure 1.**3

Listing Trends by Neighborhood: Which neighborhoods have the highest amount of listings?

Moving from the calendar table to the listings containing additional information regarding location, stay details, listing description, and so forth, I was happy to find that the majority of null values existed in columns which wouldn't be useful in the analysis. After omitting the few listings (~10) without an assigned neighborhood, I started by gathering a better understanding of the amount and distribution of listings by neighborhood.

neighborhood-count-bar-035662-SNey2amt | Data Science Blog — **Figure 1.**4

neigh-ratio-bar-856729-EOOhHmiU | Data Science Blog — **Figure 1.**4

Figure 1.5 shows the sheer count of listings per neighborhood and, unsurprisingly, the largest neighborhoods found their way to the front of the list. To control for varying size, I imported data sourced from Analyze Boston containing the total number of housing units per neighborhood. With this extra data, I was able to create a ratio of the number of Airbnb listings over the total number of housing units per neighborhood (Figure 1.6). Considering this new information, we notice a slight reordering with both Allston and Beacon Hill taking the top two spots both having more than 3% of all housing units as Airbnb listings.

With the help of long. and lat. coordinates, I was able to create a density plot; red marking the highest density areas and blue the lowest.

heat-count-total-755527-p0zr4O8H | Data Science Blog — **Figure 1.**6

Listing Trends by Neighborhood: Which neighborhoods have the most and least expensive listings?

While the number and density of listings could be valuable insight for current and potential hosts, having a handle on listing prices across the city appeals to both hosts and potential users. I first wanted to visualize the distribution of all the listings in order to gauge outlier values as well as its skewness.

dist-price-w-outliers-617444-XumjcLIt | Data Science Blog — **Figure 1.8**

price-dist-no-out-090849-JjLXcbL6 | Data Science Blog — **Figure 1.8**

As noticed in both distributions, it's very apparent that we are dealing with non-normal shape with a clear right skew. By definition, the upper outlier threshold is $425 but, by doing do, I would be omitting more than 100 listings for the original list. As a compromise, moving forward with the price analysis, I set the outlier threshold at $600, which discounts roughly 50 listings of the 3,585 original.

Looking at the middle tendencies (mean, median) of listing pricing per neighborhood (only considering neighborhoods with at least 30 listings), we find that South Boston Waterfront, Downtown, Chinatown, Back Bay, and West End make up the five most expensive neighborhoods while Allston, West Roxbury, Roslindale, Dorchester, and Hyde Park round out as the cheapest area per night.

neigh-price-exp-580423-IqsY9x4W | Data Science Blog — **Figure** 1.10

second-version-706914-qmtnImlq | Data Science Blog — **Figure** 1.10

The idea of 'value' found in one of the review-related features (review_score_value) led me to ask the question relative to listing price and neighborhood. What would make a listing a better value in comparison to others? In this context, in order to answer the question, I imported mean rental prices for each neighborhood and compared them to the equivalent mean listing prices. For this exercise, value is defined as those neighborhoods with lower listing prices relative to rent averages.

rent-listing-price-scat-234113-rbUqknbV | Data Science Blog — **Figure 1.12**

From Figure 1.12, neighborhoods below the best fit line (created from a simple linear model) represent the best value. While neighborhoods such as the North End, the South End, Beacon Hill, and Fenway didn't crack into the top 5 most expensive listings list, options in these neighborhoods could offer guests the best bang for their buck.

Neighborhood	Avg. Rent ($)	Listing Price	Ratio (Listing Price / Avg. Rent)
North End	4,141	195.67	0.04725388
South End	3,683	196.17	0.05326471
Fenway	3,569	192.97	0.05406984
Beacon Hill	3,484	199.48	0.05725793
Chinatown	4,003	232.35	0.05804449
West End	3,566	209.59	0.05877505
Back Bay	3,919	232.04	0.05920905
Downtown	3,817	236.45	0.06194899
Leather District	4,003	253.60	0.06335249
Bay Village	4,048	266.83	0.06591733

Most Expensive Neighborhoods (Listings) - Avg. Rent vs. Avg. Listing Price Ratio

Data Listing Trends by Neighborhood: Which neighborhoods are the most widely booked areas in Boston?

As the final portion of the initial exploratory analysis, I thought it would be interesting to get a feel of the amount of traffic each neighborhood throughout the city tracked. With a baseline idea of where each neighborhood stacked up against one another in the number and price categories, would one have more of a correlation than the other when it actually came down to making a reservation? If there's a discrepancy between the sheer count of listings, price, and the amount of guest traffic, we would know that other factors play into that decision making process (proximity to tourist attractions, quality of listing, neighborhood characteristics, etc).

Data Indicator of area popularity

Using the number of reviews each listing receives as an indicator of area popularity, we can rank the most and least booked areas. The one major flaw in this approach is that not all guests leave reviews after their stay. For the sake of this analysis, I'm assuming that the ratio of those guests who leave reviews and those who don't is consistent across all neighborhoods, allowing for an accurate depiction of relative booking frequency. Grouping by the neighborhood once again, I summed all the monthly reviews and divided the total by the total number of listings per neighborhood. Below is a chart of the neighborhood averages per month along with an accompanying heat map:

reviews-neigh-2-544814-fpQgeveZ | Data Science Blog — Figure 1.13

heat-reviews-204148-yj51FmqL | Data Science Blog — Figure 1.13

I was quite surprised to see East Boston averages almost doubling (49% higher) the next highest review average especially considering the neighborhood's rank on listings count (11th out of 26). With moderate pricing, a low ratio of Airbnb listings to total neighborhood units, and high levels of guest traffic, East Boston would be a clear favorite in the eyes of a prospective host searching for an optimal location.

After a closer look at the reviews ranking, I think it's worth mentioning that lower cost listings (4/5 least expensive listings are within the top half) have a noticeably higher review count. An explanation for this trend could be that a review is more likely to be left for lower quality listings complaining about various deficiencies than it is for higher quality apartments in more expensive neighborhoods. If this is the case, this strategy to gauge popularity is invalidated as more reviews would be left for listings with negative qualities despite equal traffic in others.

2. Impact of Airbnb Listing Count on Rental Trends

As mentioned in the intro, when doing my initial search on the topic of Airbnb listings, I continually came across articles published citing the relationship between the total number of Airbnb listings and long-term rental prices. A positive correlation between the amount of listings and average rental prices makes sense in the context of supply and demand; the more short-term home-sharing options translates to a decrease in the supply of traditional long-term leases.

Assuming the population of those looking for long-term lease agreements remains constant (or increases), landlords are able to raise rental prices. While most of the studies I found demonstrate the correlation over a longer amount of time (4+ years), I'm curious whether the effect existed on smaller time scale, namely between September 2016 and September 2017, the date range of the dataset used.

Average rental prices in Data

Thanks to Zillow, I was able to export average rental prices between the time period of interest. The rental data used was split by apartment type (studio, 1-bedroom, 2-bedroom, etc) and in order to make the appropriate correlations, I grouped Airbnb listing data by apartment type as well. Below (Figure 2.1 & Figure 2.2) are the time series plots of both total listing count and average rent by month and apartment type:

number-airbnb-listings-month-146913-FIctGZri | Data Science Blog — **Figure 2.1**

avg-month-rent-577287-KPuFdDm9 | Data Science Blog — **Figure 2.1**

The Pearson correlation coefficients of listing count and average rent by apartment type reflect the lack of a relationship noticed by taking a quick look at the plots.

Pearson Correlation Coefficients:

Studio	1-Bedroom	2-Bedroom	3-Bedroom
-0.18	0.78	-0.64	0.03

Pearson Correlation Coefficients (by apartment type): Number of Airbnb Listings vs. Average Rental Prices

Grouped by apartment type, it's apparent that a clear relationship between the two doesn't exist. Interestingly, by taking the sum of Airbnb listings count and average rent indices across all apartment types by month, a more defined relation is displayed.

num-listings-total-074031-zmDlXJxg | Data Science Blog — **Figure 2.3**

avg-rent-total-045901-bO5YU87R | Data Science Blog — **Figure 2.3**

With a Pearson's coefficient of 0.79 between number of listings and average rent ignoring the apartment type classification, the next question to ask would be which influences the other? From the appearance of the two plots, we notice two significant drops; number of listings trending downwards starting in 12/16 and average rent falling significantly in 01/17. Using the cross correlation function (ccf in R), we're able to flush out different correlation coefficients between the two time series at different lags.

cross-corr-634613-z3OVSkku | Data Science Blog — **Figure 2.5**

While keeping the average rent values constant and generating correlation coefficients from lag/lead total listing values, we get a better idea for the causation interval. From ccf plot above, we observe...

Lag:	-2	-1	0	1	2
Correlation:	0.71	0.82	0.80	0.64	0.44

Cross Correlation Results

The number of lags were specified such that each lag represents a month. For example, if we shifted the total listings count back by one month (Lag: -1) and calculated the correlation coefficient with the unaltered average rent time series, the coefficient is 0.82. The fact that a lag of -1 has the highest correlation suggests that trends in the total number of Airbnb listings are most strongly correlated with average rent costs a month later.

I have to be careful with my word selection since, as many people know, correlation does not imply causation. There are a multitude of potentially stronger factors contributing to average rent activity (macro real estate trends, regional economic activity, changes in school systems and other public services, etc.); the number of Airbnb listings could be another. To be able to see that positive impact between the two variables on a much smaller scale reinforces the relationship analyzed in many other, more expansive studies.

3. Top Listing Features that Influence Airbnb Price

With the increasing popularity of short-term leasing as a form of passive, secondary income, understanding the attributes of a listing which most notably influence listing price is a major advantage for current and potential hosts. For those interested in purchasing a property for the sole purpose of renting, important considerations would include listing location (neighborhood, city, proximity to tourist locations), property type (apartment, entire homes, B&B, etc.), type of room (entire, shared, private), and so on.

For those hosts already posting listings, it would be very helpful to know which amenities are most highly sought after, whether to allow extra guests (if so, how many?), as well as listing presentation on the Airbnb platform (host verification, host profile picture, etc). Through the use of predictive ML models, the last section of my research aims to answer the question: Which Airbnb listing features have the largest impact on listing price? While there are many different ways to approach this problem, I decided to create a Random Forest model and examine the predictor variables' feature importances to extract business context.

Data Preparation - Feature Selection

Before running the first model, there was a significant amount of prep work to be done. Starting with the original list of 95 variables within the listings data frame, the first step was to eliminate both columns which wouldn't provide any insight into listing price (target variable) such as unique identifiers, listing description, host name, etc. as well as columns with large missing proportions (listing square footage - 95% missing). With a collection of relevant columns for the model, it was then time to tackle missing values in select columns.

Data Imputation

In the process of addressing missing values, I did consider using a tree-based imputation algorithm, although, on second glance through the columns with missing data, the majority of the missing values could be classified as structural missing data, meaning it was missing for a logical reason. For instance, observations in the security_deposit column were left blank as it was not required for that particular listing. Similarly, the missing values in the bedrooms column were due to the fact that the listing was a studio apartment, in which case 0 was inputted.

Missing Data Pattern

Missing values in the review-related columns presented an interesting scenario as, while missing values in the total number of reviews column translated to 0 total reviews, the other columns actually ranking certain characteristics of the apartment needed more thought. Simply inputing 0 as a score for listing cleanliness, host communication, overall value, and so on would inaccurately punish the listing for not having any reviews. Furthermore, other columns in the dataset wouldn't be helpful in hypothesizing user reviews on host communication for instance. Being missing completely at random (MCAR), I decided to input missing values with the median value for each respective column.

Data Preparation - Feature Engineering

After reducing the original dataset from 95 to roughly 40 total variables, I made a point to bring external data into consideration. Taking my knowledge of Airbnb and the primary demographic it serves (tourists) paired with a desire to incorporate supplemental geographical data into the model, I was curious to flush out the possible influence of popular tourist locations on surrounding listing prices. For instance, would Airbnbs near Fenway Park have noticeably higher nightly pricing? To investigate the effect, with the longitude and latitude coordinates provided, I calculated the distance from each listing to 16 different popular tourist attractions in Boston; Boston Common, Museum of Fine Arts, the Tea Party, Old North Church to name a few.

Average distance considering all the tourist locations

For each listing, I then took the average distance considering all the tourist locations. Knowing that I would be using a tree-based model down the road, it wasn't necessary to delete each individual destination distance as random forests aren't as sensitive to multicollinearity (to an extent). In addition to the tourist location distances, I also included a column which served as the sum number of amenities a listing has provided (added across each of the dummy amenity columns). Keeping the individual tourist destination distances as well as each dummy amenity column despite being potentially damaging to the model was mainly motivated by the underlying objective of the model as each of those columns will carry a feature importance.

Feature importance

By knowing each individual feature importance, we will be able to relatively differentiate one tourist location to the next, or one amenity from another, which will prove to be valuable information in a business context.

With no missing values remaining and all relevant columns accounted for / new features created, the last step in the data preparation process was to create dummy variables for each of the categorical features and assign numerical values to the binary columns.

Model Selection

Moving from the data prep to modeling portion of the project, it was very clear as to which type of model I was going to be using. With a large amount of predictor variables (163) comprised of both numerical (41) and categorical (122) types, I was looking for a model type that handled both kinds of variables and, more importantly, had a lower sensitivity to multicollinearity. Additionally, I was well aware that the data consisted of linear and non-linear relations as well as notable outliers in the target variable (See Figures 1.8 & 1.9).

The fact that random forest models handle linear and non-linear relationships, are generally robust to outliers (more to come here), work in regression settings, and implicitly perform feature selection to avoid multicollinearity through bootstrapping made model selection an easy decision. That being said, I would need to be very conscious of the bias-variance tradeoff as RFs are known to overfit training data. Also, feature importance being the primary objective of the model, it would be much more difficult to gather further insight behind the decision-making process of feature ranking as random forests are more of a 'black box' relative to other models.

Initial Model Results

If there's one thing I've gathered from research on machine learning model types, it's that every blanket statement made is always met with a caveat- it depends on your data. The first batch of models created were quickly run to reinforce the generally accepted characteristics of random forests touched on above. Knowing that both the total_amenities and avg_tourist_dest_dist would be highly correlated with the individual amenity dummy columns and individual tourist destination distances, I wanted to confirm whether or not they would affect model performance if both total/average and individual columns were to be included. Similarly, with a considerable amount of extreme outliers in the target variable, would RF model performance be altered if they were left?

Baseline Models

In response, I ran 6 separate baseline RF models with and without outliers for various collections of predictor variables.

All features - all 163 predictor variables
No Totals / Averages - all features besides avg_tourist_dest_dist and total_amenities (161 total features)
Only Totals / Averages - all features without individual amenity dummy columns or individual tourist destination distances (replaced by only avg_tourist_dest_dist and total_amenities)

	Outliers included	No Outliers (<=$600)
All Features	Scenario 1: R² = 0.358 Train RMSE = 125.19 Test RMSE = 67.20	Scenario 4: R² = 0.702 Train RMSE = 55.25 Test RMSE = 57.07
No Totals / Averages (All dummy amenities, all individual distances)	Scenario 2: R² = 0.358 Train RMSE = 125.04 Test RMSE = 67.20	Scenario 5: R² = 0.702 Train RMSE: 55.21 Test RMSE: 56.77
Only Totals / Averages (No dummy amenities, No individual distances)	Scenario 3: R² = 0.379 Train RMSE = 122.89 Test RMSE = 70.64	Scenario 6: R² = 0.706 Train RMSE = 54.87 Test RMSE = 57.31

Initial Feature Selection & Outlier Treatment Model Results

From the results of each of the separate models, the results didn't vary largely with changes in the feature space, although, outliers clearly influenced model performance for the worst when included. Moving forward, I decided to continue with all features and outliers removed (Scenario 4) as model performance looks to be negligibly effected by the presence of all features.

Final Model Tuning

Due to the higher dimensionality of the data, I decided to implement the Ranger and Caret packages in R to tune the baseline model in hopes of decreased error (RMSE) and higher accuracy (R²). The RF hyper-parameters contributing to optimal model performance include (not an exhaustive list):

ntree: number of trees in the forest
mtry: the number of variables considered at each node split
node_size: minimum size of subsample before node split (the smaller the node_size, the more splits)
max_depth: maximum depth of each tree in the forest (the higher the max_depth, the more splits)
sample_size: size of each bootstrapped sample to train each individual tree

Grid Search

By performing a grid search with varying values of each hyper-parameter, we are able to extract the top performing values to then use in our final RF. I'd like to note that grid searches can be extremely computationally expensive as they try each of the possible hyper-parameter combinations. For instance, if you were to manipulate only three of the parameters listed above giving 4 options for each, the algorithm would have to evaluate 4³ (64) different hypothetical model variations. In my case, I evaluated a total of 270 different combinations (was not a fast process).

Optimized Model Hyper-parameters:

ntree = 460
mtry = 26
node_size = 3
sample_size = 1 (100%)

In the baseline models run prior to the model tuning, I 'cheated' by evaluating against the holdout test set. In a more formal context, the test set would only be evaluated once the model has been fully tuned and prior versions of the model would be evaluated using the OOB (out-of-bag) sample. The hyper-parameters listed above were chosen as they best minimized the OOB sample RMSE . To get a better sense of the OOB RMSE distribution, I ran the optimized model (optimized_ranger) 100 times and plotted the results (Figure 3.1).

optimized-oob-rmse-608456-2X4bBNX8 | Data Science Blog

Final Data & Model Performance Metrics:

RMSE (OOB) = 56.19
RMSE (Test) = 56.58
R² (OOB) = 0.69

Using optimized_ranger to get predictions on the hold-out test set, we get an RMSE (test) of 56.58, slightly lower than the baseline model. All in all, the final model proves to have low variance and higher than expected bias, contradicting stereotypical RF perceptions (low bias, high variance). Exploring predicted versus actual values within the test set will highlight exactly where the model excelled and failed.

actual-vs-pred-density-866884-HadxZ4kg | Data Science Blog — **Figure 3.2**

actual-vs-pred-scatter-183868-4B0ub7pe | Data Science Blog — **Figure 3.2**

First looking at the density plot (Figure 3.2), we notice the final model...

Over-assigned listing prices in both the $70-$100 and $175-$250 ranges
Under-assigned listing prices between $100-$175 and $300+

In Figure 3.3, the black dashed line represents the scenario where model predictions are exactly equal to the equivalent actual listing values. Points located above the dark dashed line are those listings where model predictions are greater than actual listing price; points under being listings where actual listing price is greater than predicted. The parallel gray dashed lines and point color scale convey varying levels of absolute error (dashed lines in intervals of $50).

Special Cases Using Data

Especially focusing at actual listing prices exceeding $400, we see that the model under-predicted every one illuminating how poorly it handled relatively expensive listings. Figure 3.4 provides additional color to the ratio of under and over predictions by actual listing price. Here we see a clear trend: High percentages of over predictions starting with the least expensive listings (100% over-estimates in listings under $50) gradually transitioning to high percentages of under predictions in the most expensive listings (100% under-estimates in listings >$350).

If sheer model accuracy was a major priority of the project, there are multiple avenues to further investigate. As mentioned before, a target variable with both extreme outliers and a noticeable right skew are prime suspects. With more time, it would be interesting to track the accuracy of the model implementing a transformation on the target variable (log, box-cox, normalization) and / or a lowered target variable outlier threshold (originally set at $600).

Additionally, in the predictor feature space, possible solutions for improvement could include a more extensive feature selection / dimensionality reduction process (i.e. recursive feature elimination). Lastly, as RF models are able to interpret multi-level categorical variables, running a model without dummy variables could also make a difference. With the goal of the section being feature importance, optimized_ranger's performance, while not ideal, is sufficient for reliable variable importance values.

Deeper Dive Into More Data and Feature Importance

Before diving into the feature importances themselves, I think it's important to make the distinction between the two major types of variable importance. For random forest regressions, there are two kinds: variance reduction and permutation. Put simply, feature importance through variance reduction defines important features as those which, on average, best reduce variance at each node split considering all trees within the forest. On the other hand, permutation feature importance is calculated post model training phase as the degree at which model performance is influenced by the reshuffling of each predictor variable.

While variance reduction is slightly less computationally expensive as permutation, it is known to inaccurately depict feature importance for both continuous and high-cardinality categorical features. Despite not having any categorical features with more than 2 layers (due to feature dummying), I decided to move forward with permutation importance as the dataset consisted of both continuous and non-continuous predictors. Below are the 25 variables with the highest variable importance:

rplot01-531624-NEHQgCxf | Data Science Blog — **Figure 3.5**

Of the top 25 features, interestingly nearly all are continuous variables (23/25), however, the only two categorical variables are found within the top three spots and both are dummy columns for the overarching variable of room_type. From this first bar chart, we see that room type, number of bedrooms / accommodates / beds / bathrooms, distances to popular tourist destinations, the cleaning fee, and host total listings count were the most influential in determining listing price. 16 of the 25 top features being tourist destination distances highlights a major flaw in permutation feature importance as it often overestimates the importance of correlated features.

Subgrouping by feature importance

Next, I filtered features and their importances into subgroups based on the following themes:

Tourist Location Distance (3.6)
Neighborhoods (3.7)
Reviews (3.8)
Listing Characteristics (3.9)
Amenities (3.10)
Host Related Variables (3.11)

feat-imp-tourist-location-380047-tDdeyL1K | Data Science Blog — **Figure 3.6**

feat-imp-neighborhood-044492-OPEx7sWM | Data Science Blog — **Figure 3.6**

As previously mentioned, RF models being 'black boxes' makes the interpretation of feature importances slightly difficult. Unlike linear regressions, where variable coefficients provide both a magnitude and a direction relative to the target variable, feature importance from a random forest only includes a magnitude for which each feature influences the dependent variable.

Without direction (+ / -), we can only make projections based on EDA and contextual knowledge. With that said, feature importance in random forest models is a more effective tool to gain further insight on the model itself (informing performance improvement strategies) than it is to extract subject insight. Furthermore, general insight derived from feature importance should be taken with a grain of salt and deserves additional validation.

Tourist Location Distance Insights / Observations (Figure 3.6)

As expected, locations close in distance to one another have similar feature importances
Interestingly Beacon Hill / Boston Common / Public Garden distance distributions most closely reflect that of the price variable

Neighborhoods Insights / Observations (Figure 3.7)

Unsurprisingly, if we look at the top 10 most important neighborhood dummy columns (East Boston through South Boston), 8 of the 10 also exist in the top 10 highest number of listings per neighborhood
East Boston having the highest feature importance while not existing in the top 10 most listings could suggest a wider price range relative to other neighborhoods

Reviews Insights / Observations (Figure 3.8)

Review amount and frequency (Number of Reviews / Reviews per Month) atop the list suggest a stronger relationship with price than the actual review scores themselves.
Review scores referring to specific guest experiences (in order from most important to least: cleanliness, location, accuracy, value, communication, check-in) provide a potentially useful ordering reflective of what guests truly value (or don't value).

feat-imp-listing-characteristics-033332-nrWKYH3Z | Data Science Blog — **Figure 3.9**

feat-imp-amenities-979503-1v3RpJOT | Data Science Blog — **Figure 3.9**

Listing Characteristics Insights / Observations (Figure 3.9)

Bedrooms ranked higher than both accommodates and beds offers an interesting distinction. Do guests pay a premium for listings with a different room for every bed versus multiple sleeping options per room (assuming yes)?
With the amount of bedrooms and bathrooms being common real estate valuation metric, this model suggests that the number of bedrooms are more influential on listing price.
Room type - entire home / apartment above both room type - private room and room type - shared room could lend insight as to how to best divide a larger property into sub-listings.

Amenities Insights / Observations (Figure 3.10)

Similar to the neighborhood-related variables, because each individual amenity was interpreted as a binary column, we need to be careful in the insight we derive from permutation feature importance as frequency of amenity could cause biased results
Of the top 10 amenities (gym, washer, free parking on premise, TV, family/kid friendly, elevator in building, air conditioning, cable TV, pool, hair dryer), pool is the only variable which less than 10% of the total listings have
Assuming the listings that have these amenities increases the listing price (fairly safe assumption), this list offers valuable insight to hosts interested in raising price without any major changes to the property

Host Related Variables Insights / Observations (Figure 3.11)

host_since and host_total_listings_count at the top of the host-specific variables (and compiled feature importance list) brings up an interesting question - does increased host activity (time duration as a host & total number of listings) increase their listing price(s)?

Concluding Thoughts & Next Steps

To close out the project, below are the condensed takeaways from each section:

EDA

Seasons, days of the week (weekend vs. weekday), and holidays matter!
- Listings are most expensive during the summer and fall and prices consistently increase over the weekend vs. during the week.
- Listings are more expensive over holiday periods relative to the seasonal averages that the holiday falls under, although, there are notable exceptions (Thanksgiving & Fall).
Allston and Beacon Hill have the highest concentration of Airbnb listings (>3% of all housing units).
North End, South End, and Fenway provide the most cost friendly housing options among neighborhoods with the most expensive listings on average.
Based on frequency of reviews, East Boston proves to be the most trafficked neighborhood by Airbnb guests.

Number of Listings vs. Rental Trends

A fairly strong correlation exists between the number of Airbnb listings and rental prices; the two being most correlated when lagging rental prices by one month hinting causality (not proven).

Influential Listing Features

Expected features such as beds/bedrooms, accommodates, bathrooms, and room_type played pivotal roles in determining listing price
Listing distances to popular tourist destinations showed high feature importance stressing the value of location in real estate related studies
The most influential amenities mainly consisted of those not commonly found in listings
- Gym, free parking on premise, elevator in building, air conditioning, pool, etc.

Concerning next steps of the analysis, in the EDA portion of the project, I strictly focused on grouping and analyzing the listings by neighborhood. To deepen the insights, it could be interesting to consider additional filters past the neighborhood layer. For listing density and price, how do neighborhoods compare also considering factors such as the number of accommodates, bedrooms, or bathrooms, for example. Additionally, it would be interesting to invest additional effort into guest traffic beyond the number of reviews.

Extra Variables

With access to variables such as the number of guests each listing has hosted or the total number of nights guests have stayed in each listing could validate the popularity approach taken and would prove to be extremely useful in the hands of hosts. Also, I only scratched the surface in finding strategies to determine the value of each listing by comparing listing price to rent averages in similar neighborhoods.

After a quick look on Airbnb's site, there doesn't seem to be any indicators stacking one listing against others with similar traits. Providing more context regarding other listings, real estate trends, or local activity could be a useful search extension helping users gauge value (good deal vs. bad deal).

Airbnb listings and long-term rental trends

Tackling the relationship between the amount of Airbnb listings and long-term rental trends is no easy feat and the brief correlation analysis conducted in the second part of the project is only the tip of the iceberg. To continue the analysis, more extensive and recent data would be required covering a longer period of time (more than one year). Moreover, implementing other techniques to analyze synchrony between two time series beyond cross correlation would increase confidence in results; other strategies could include Dynamic Time Warping (DTW) and Instantaneous phase synchrony.

Takeaway

If there's one thing I've taken from designing and tuning ML models, it's that there will always be something you could improve. Despite thorough random forest model tuning, I was unable to decrease RMSE values under $50 or improve R² above 0.7. If I were to continue with this model, I would need to further investigate outlier thresholds, feature selection, the treatment of categorical variables, and potentially transform the target variable.

In hindsight, with feature importances as the objective, using a more interpretable model (i.e. regularized linear models) could've been a better route as drawing contextual information from variable importance would've been more straightforward.

Analyzing the Boston Airbnb dataset offered a great introductory look into EDA and machine learning techniques and is a great starting point for more directed future research.

References

Barron, K., Kung, E., & Proserpio, D. (2017, July 25). The Effect of Home-Sharing on House Prices and Rents: Evidence from Airbnb. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3006832
Zumper. (n.d.). Average Rent in Boston, MA and Cost Information. Retrieved December 15, 2021, from https://www.zumper.com/rent-research/boston-ma

About Author

Ben Burkey

Ben graduated from Bates College in 2017 with a B.A. in Mathematics. After a year of international teaching, he went on to work in consulting as a Project Manager tackling projects in a variety of industries, most notably...

View all posts by Ben Burkey >

Airbnb Rental Market in Boston, MA - Natluk February 20, 2022

[…] Source link […]