‘Airbnb com vs Hotels.com’ - A Webscraping Project

Posted on May 8, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.   

                  GitHubWeb Scraping airbnb and hotels

Introduction:

When people are looking for places at which to stay when traveling, they can opt for a standard hotel room or an apartment. Two websites that they can refer to are Airbnb.com and Hotels.com. Scraping those sites can reveal which factors that drive the price of rooms and rentals. Analyzing the results could provide  insights on which choice would be the best one for a particular location based on features and price.

About the Businesses:

  • Founded in  2008 in San Francisco, Airbnb.com is a community-based online platform for listing and renting local homes. While it does not own any of the properties itself, it connects hosts and travelers and facilitates the process of renting Moreover, it cultivates a sharing-economy by allowing property owners to rent out private flats. It earns its revenue by charging a 6-12% non-refundable service fee from guests and 3% processing fee from hosts.

 

  • Established in : 1991 in Dallas, Texas, Hotels.com is a website for booking hotel rooms online and by telephone. Its inventory includes hotels and B&Bs, and some condos and other types of commercial lodging. It has a commission-based business model.
  •  

Why do we care about such kind of data?

  • To understand what fits best in your budget
  • To understand what can make your travel experience better
  • Businesses comparison and improvement

Web Scraping Challenges and Solutions:

  • I started my Airbnb web scraping journey by creating a spider in Scrapy. Though the crawl did go through, it only yielded  an empty result set. Airbnb data is loaded by AJAX, which is why scrapy xpaths could not yield result. So I chose to scrape using Selenium. It worked like charm, but it comes with its own drawbacks of extremely slow speed. It is slow because it mimics user action. But on the positive side,  it can elude detection as a robot. Another possible option ise Scrapy splash
  • Airbnb mandates you to provide a travel location to enter the website. Consequently,  you cannot scrape their whole website easily. I decided to analyze two locations of different nature: Manhattan and Orlando.
  • Selenium has a next_button.click() feature that allows you to click next page like a user and grab listings from next page. But the challenge comes when Airbnb limits you to maximum 17 pages.  If you click “next” after 17 pages, you are redirected to page 1. I collected links of listings from href of all 17 pages and iterated over it to read through individual listing.
  • While scraping, I also found that some of the variables were not getting caught by my code because the scraper was trying to fetch it before the page is loaded entirely. To solve this problem, I added wait time until the page is completely loaded. Selenium also lets you set sleep times to mimic user action which is very helpful.
  • Next came different page layout issue:

 

As you could see, in picture 1, the price, review rating, number of ratings, etc. that was available at the top of page was not available with the same xpaths in picture 2. So I handled it with different xpaths in Selenium.

I used Selenium for scraping hotels.com too as it came with most of  the similar challenges. One additional challenge here was:

  • The page loads on scroll, there was no next button to click and get next set of listings. To solve this problem, I used window.scrollTo() method, to scroll and load the listings. Saved the listing hrefs in list, then iterated over each links to scrape the required data.

Data Cleaning and Manipulation

I used pandas dataframe to load data from csv and cleaned it in Python.

AirBnb:

Here is the sample of data that I scraped from airbnb website:

This data needed a lot of cleaning before it could be used to derive insights. As a rule, numerical values are good for statistical analysis. I cleaned $ from price, picked numbers from number of reviews and typecasted them as int datatype. I picked each element from occupancy and saved them as guest, house size , no of beds, no of baths and derived numbers from them.

The variable 'house' tells you number of bedrooms the house has. For example,the house variable for a 3 bedroom house will contain 3 as number. 'Studio' apartments don't have any bedrooms, so logically it was updated to 0 by my code. But practically, it’s a wrong evaluation of size of the house because a studio apartment does include living space, and the prices are pretty close to 1 bedroom apartments.

Having them as 0 would lead to wrong correlation between price and size of house. Consequently, I imputed 'Studio' house size to 0.5 and used it as a numerical feature. I checked for missingness and dropped NAs/nulls. Following is a snapshot of sample data after cleaning airbnb data:

 

Hotels:

Here is the sample of data that I scraped from hotels.com :

I cleaned price, hotel_star, no_of_ratings using python re.findall() function and typecasted them as integer/float. I used occupancy variable to derive 2 new variables no_of_guests and room. I scraped latitude and longitude for future work ( since airbnb doesn't disclose listing address until booking is confirmed, I couldn't scrape addresses, hence comparison on exact location is out of scope of my current analysis). I cross-verified for NAs and nulls. Following is a snapshot of  sample data after cleaning Airbnb data:

Data Visualization and Analysis

My data analysis is based on following key factors:

  • Compare factors influencing AirBnb prices
  • Compare factors influencing Hotel prices
  • How prices vary based on different type of location (Manhattan vs Orlando)
  • Compare AirBnb business to Hotels.com business
  • What kind of accommodation should one choose based on budget based on number of people travelling?

I did my visualizations using seaborn,  matplotlib (with ggplot style).

So, I started with a null hypothesis testing for population data of airbnb and hotels. I chose Scipy Two sample t-test and this is what the result looks like:

Ttest_indResult(statistic=-6.462759938122756, pvalue=1.482752809922366e-10)

The results show p-value is extremely small. This indicates that hotels and airbnb prices are unlikely to have the same mean and are statically different.

I decided to start with getting an idea of what prices across different city for both businesses look like.

Price Range:

The boxplot above shows that airbnb Manhattan has higher prices than Orlando, as expected,  averaging around $150 and $80 respectively. Similarly, Hotel prices in Manhattan are slightly higher than Orlando; however, contrary to my expectation, Hotels seems to have average price similar to airbnb rentals, but then look at outliers hotels have. The priciest accommodation in Manhattan at airbnb is just $500, whereas you got to pay $1200 to get best of hotels.

Let’s check probability density function and see where does my most population lie:

 

House Type, Popularity and Price:

Next, I wanted to analyze what kind of accommodation is most popular and what prices are they offered at. Here is the thing about reviews: "Which one would you go for?". Is a listing with 5 star review with only 2 numbers of reviewers better than a listing with 3.5 star review with 100+ reviewers?  I would trust the 3.5 star review more. Holding onto that thought, I derived popularity based high review stars and high number of reviewers.

Following plot shows popularity of house types vs price for airbnb Manhattan:

“Private room in loft” is the most popular rental type with an average price of $225/night.

“Entire Serviced apartment” and “Entire guest suite” are next two popular ones and are moderately priced around $120 -$130 on an average. “Room in hotel” is the least desired commodity on airbnb and is highly priced around $500.

Following plot shows popularity of house types vs price for airbnb Orlando:

In Orlando, airbnb “Room in hotel” has higher stakes than in airbnb Manhattan (however, they are not the most popular ones ).

The most popular accommodations are “Private Room in guest suite” and “Private room in cottage” which are moderately priced only $70 per night on an average.

Let's compare similar statistics of hotels accommodation types in Manhattan:

 

As expected, 5 star hotels are less popular due to their high price. However,1 star hotel hotels are also not so popular because even though they are cheap, they don’t deliver on service, which can result in a bad experience.. These 1 star hotels would cost you around $70. The most popular choices are 2.5 star hotels with an average price of $140. If you spend $140 at airbnb, you can rent entire service apartment in Manhattan.

Here are the statistics of hotels accommodation types in Orlando:

One amazing fact that emerged from the data : A larger percentage of visitors in Orlando prefer to book a 5 star hotel. One factor in that preference could be the relatively affordable price of the better hotel: an average of $240. Another could be the preference for a more spacious room because Orlando is usually s a family vacation destination.. Still the most popular hotel type is 3.5 star hotels which costs $125 on an average. The most popular choice on  Airbnb for Orlando costs just $70.

House Type and its Size:

Next, I wanted to see what extra airbnb can offer that hotels can't.

If you remember from previous plots, private room in loft was a popular choice. When I went into details, I found that it can accommodate up to 5 guests, whereas the most popular choice on hotel.com was 1 room 2.5 star hotel that can accommodate 2 guest (sometime add on 3rd guest)

A picture is worth a thousand words:

"Entire townhouse" on Airbnb rentals can accommodate up to 7-8 guests on an average which is prices $140 in Manhattan on an average. You will need to pay for 3-4 hotel rooms for 7-8 people. So if you are travelling in large groups,  Airbnb could be preferable choice.

Price by Ratings:

One might have a predefined notion, that better service comes with extra price. But that does not appear to be the case for Airbnb.

As the regression line shows, the popularity of a rental doesn't drive the price in airbnb market. You can find highly rated places to rent without paying extra bucks.

Point worth mentioning: Airbnb rentals have minimum rating of 4 stars which is an indicator of how much its customers are satisfied with the service.

Hotels price are correlated with popularity of the hotel, and we also see a linear relationship between them.

Let's drill through other factors and see if we can find what drives airbnb prices.

airbnb price and Number of guests:

We do see a trend here. Price and number of guests seem to have a positive linear relationship in Manhattan and Orlando. However, the slope is slightly smaller in Orlando, though the values are comparatively more correlated because larger spaces in Orlando are not as pricey as they are  in Manhattan.

The plot above gave me a direction and I decided to visualize price vs size of the house.

Price and House Size:

As we see from the correlation matrix, prices are mostly correlated to house (which is an attribute for house size). Also there is linear relationship between price and size of the house.

It's important to see if this correlation is significant enough. My next plot throws light on this.

Pearson correlation and p-value:

The relationships and p value is described in following plots:

Where should I rent if I am travelling in Manhattan?

Where should I rent if I am travelling in Orlando?

Factors that contribute to this decision would be how many guest are staying together and budget for the trip. If you are low on budget and travelling in large groups, Entire townhouse is the best approach. For small groups, though,  average rated hotels can be an option (but they are average.) Airbnb Entire service apartment are highly rated and provides you extra hospitality at a lower cost. If budget is not an issue then, 4-star and 5-star hotels can be a choice, too.

Conclusion

This analysis is based on the data which was scraped by me from both the websites. This data can change overtime and hence the outcome. Here is my conclusion based on trend we have analysed in the scope of this project:

Based on data driven evidence we can conclude that:

  • Airbnb price has significant positive linear relationship with size of house and number of guest it can accommodate.
  • Prices doesn’t inflate with popularity of rental. Hence, highly rated and popular rentals are not pricey.
  • Airbnb customers have mostly great things to say. There review ratings (mostly > 4.0) are way higher than hotels average review rating.
  • A hotel’s price has significant positive linear relationship with its  l star ratings and the hotel’s popularity.
  • Highly rated hotels are costlier than highly rated rentals in Airbnb. However, there are average or below average hotels that might be cheaper than Airbnb rentals (but may come with poor service).
  • In Manhattan, Airbnb or average star rated hotels are more popular whereas the  Orlando crowd does appreciate luxury and high price of 5 star hotels.

This analysis didn’t have exact location details to compare (Airbnb provides location only upon confirmation of booking). Location would have a huge impact on price and choices one can make.

Non data driven facts:

  • Millennials love Airbnb, its gives you a whole new travel experience as opposed to cookie cutter hotel experience.
  • Not all rentals at Airbnb are secure, all one needs is an email id and phone number to host a rental whereas hotels come with set level of security.
  • Hotels come with special packages like kids club, indoor pools etc. due to which a lot of families might still prefer hotel. If Airbnb could improve on specialties, they can expand their business to another level.
  • Mostly hotels are situated near tourist attraction, Airbnb rental are split across wide location. Both could be a fit based on where you want to tour.

Future Work

  • Scrape additional fields like amenities and compare the amenities to learn which amenities add value in user experience in both kind of businesses.
  • Create an interactive app in Dash and plug these plots in my app.
  • Scrape data of locations from west coast and compare.
  • Scrape latitude and longitude ( from Google center of listing) from Airbnb and compare with hotels based on location

Thank you for taking out time to read my blog.

You can find my code at github:  https://github.com/priyasrivast/WebscrapingAirBnbAndHotels

Feel free to take a tour of my power point presentation Web Scraping airbnb and hotels

 

About Author

Priya Srivastava

Priya Srivastava is an analytical thinker with business acumen. Her first love was STEM, which she pursued in earning a bachelor’s degree in Engineering and building a career as Software Engineer and data warehousing consultant in the technology...
View all posts by Priya Srivastava >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI