Data Scraping Tripadvisor: Finding the Best Bargains
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Ever wondered when the best time and date to book a flight is? Or how about which airlines really provide the best bang for your buck? As a casual flyer myself, I've pondered these same questions countless times without ever reaching a real conclusion that wasn't influenced by the last clickbait ad I saw on Twitter five minutes ago. Well, I finally said enough was enough and took matters into my own hands by shamelessly extracting nearly an entire calendar year's worth of flight deals on Tripadvisor.
I say "nearly" because the Tripadvisor website did not flight deals available past December 21, 2020. Keep in mind that these deals were scraped in real time on January 27, 2020.
Although I desperately wanted to use Scrapy Web Crawler for its relatively fast scraping rate, I resorted to Python's rather slow Selenium library package simply for the fact that it was able to scrape my desired site, Tripadvisor. Like most major corporations, Tripadvisor runs their site dynamically, allowing for cool features such as searching for flight deals based on lowest to highest prices or booking hotel rooms based on reviews and user ratings. Unfortunately, Scrapy is unable to handle sites that are configured this way. As a result, I created a Selenium web driver that executed the following steps:
- Opened a Google Chrome web browser that navigated to Tripadvisor's Flights website
- Entered starting departure date as January 27, 2020, origin airport as NYC (i.e. JFK, LGA, EWR), and first destination airport as LAX; submitted query
- Once the page results loaded, scrolled through every page of flight deals for NYC to LAX and wrote the pieces of text I identified from each deal (i.e. price, Fly Score, departure date, etc.) to a CSV file
- The previous step is repeated but for the four other destination airports
- Once all five destinations for an entire day are scraped, the collection of flight deals for that day are written and stored in a CSV file
- Then, the scraper repeats the search process again with LAX as the first destination but with the departure date set to the next day, January 28th, 2020
- The scraper was interrupted and stopped when it reached December 22, 2020 since there were no flight deals available by Tripadvisor for that date and beyond
Overall, I was left with one CSV file per day of flights between 01/27/20 - 12/21/20. From these daily files, I concatenated them into 12 separate monthly files based on the month of the flight's departure date. Then, I combined the 12 monthly files into one single file with every row representing a flight deal and the following variables to describe each flight:
- Departure Date
- Departure Time
- Origin Airport Code
- Arrival Date
- Arrival Time
- Destination Airport Code
- Flight Duration in Hours
- Overnight Flight Flag
- Name of Airline
- Flight Price
- Flight Deal Rating (Fly Score)
- Day of the Week Flight Scheduled (i.e. Monday, Tuesday, etc.)
The following analysis was performed using the twelve key variables listed above and a total of 57,801 individual flight deals.
Although the bulk of my analysis was exploratory in nature, I extracted a handful of insights that could be valuable to both casual flyers (those who fly at least once a year) and frequent flyers (those who fly at least once a month for business and/or leisure), as well as airline corporations. According to the trade organization Airlines for America, commercial aviation helps drive more than $1.5 trillion per year in U.S. economic activity and more than 11 million U.S. jobs.
So, if people like us are going to keep flying but also continue searching for the best quality deals while the airline industry thinks of new ways to get us to purchase those same deals, I was left with one last question: are there any potential areas of compromise where both sides can essentially get what they want?
Before we reach that juicy conclusion, we need to build up our understanding of the data as a whole through the exploration of three key factors:
- Fly Scores
Daily Flight Prices
First, let's take a look at the daily average price of flights over the entire dataset (approximately 10.5 months between 01/27/20 - 12/21/20) to five different cities including Los Angeles (LAX), Miami (MIA), Chicago (ORD), San Juan (SJU), Montreal (YUL).
Right off the bat, it's apparent that there are peaks and valleys when it comes to flight prices throughout a given year. The obvious suspects for price peaks were the major federal U.S. holidays like Thanksgiving, Christmas, New Year's, Memorial Day, and my personal favorite the 4th of July. Indeed, all of these holidays were confirmed as local price maximums in the data that I scraped shown in the plot above. However, I didn't anticipate Valentine's Day (February 14th) to have anywhere near the highest average prices for any given time span in this dataset, let alone amongst any of the federal holidays where people actually have time off work to travel. Looking further into the data, it makes a lot more sense why the days leading up to, on, and the day after Valentine's Day experience the price inflations that they did.
My evaluation is that the near $400 average daily price increase from February 11th ($103.95) to February 15th ($504.13) was most likely due to Valentine's Day falling on a Friday in 2020. Whether it was finally taking that couple's vacation to Miami or Puerto Rico, or visiting that long distance significant other out in L.A. or Chicago, I think New Yorkers were more inclined to travel this year for Valentine's Day since it didn't fall on an "inconvenient" day such as a Tuesday or Wednesday. With the day falling on a weekend, couples had the option to plan memorable weekend getaways instead of settling for overpriced dinner reservations at a local restaurant they visited last week. My contention is that airlines used a similar type of analysis to mark up their prices for flights leading up to, on, and immediately after Valentine's Day in order to maximize profits. Essentially, these airlines anticipated the demand for outgoing flights between February 11th and February 15th to be unusually high this year for the reasons listed above and took full advantage of it. Indeed, this is just one angle of approach to explain this data and, in reality, there are probably several different factors that might play into this particular price surge. With more time and resources, I think it would be valuable to conduct further research on the factors and circumstances that drove airlines to quadruple their prices during this five day stretch in February.
Monthly Flight Prices
One thing that is clearly shown by this day-to-day time series is that the average daily price of flights are constantly fluctuating. As a consumer, there aren't many practical insights I can draw from this data. However, when the time series is manipulated to show average monthly prices versus average daily prices, a few insights seem to stick out.
Here, we're able to see that at the time this data was scraped in January 2020, September 2020 was projected to have the cheapest overall flights averaging a whopping $147 per flight. Keep in mind that this data included flights from NYC to both Montreal, Canada and San Juan, Puerto Rico. Meanwhile, June 2020 was projected to have the most expensive flights averaging about $204 per flight.
For bargain hunters, this information could be quite useful when planning their next vacation as they could potentially save an average of $57 per ticket by waiting an extra 3 months if their schedule and circumstances are flexible.
Flight Prices by Day of the Week
Great, now we know which months to look into and which ones to avoid. So, which day of the week has the cheapest flights available? Well, if you happen to have a very flexible schedule and can pick any day of the week to fly, your best bet is booking a Tuesday flight as shown in the distribution below. To not get distracted by outliers, I limited the data to only show flights $500 and below.
Out of all the days of the week, Tuesday has the lowest projected median price at $115 as well as the smallest range of prices.
According to the distribution of prices shown above, 50% of flights available on Tuesdays were between $89 and $154, the first quartile and third quartile respectively. Wednesday had the next lowest median flight price at $132 followed by Monday with a median flight price of $154. Contrastingly, Fridays had the most expensive prices with a median flight price of $179 followed by Sundays at $171 and Thursdays at $169.
Fly Scores by Day of the Week
Similarly, the distribution of Fly Scores for each day of the week only reinforces the conclusion that Tuesdays and Wednesday are the best days to book a flight.
Indeed, Tuesday and Wednesday had almost identical distributions of fly scores with a median score of 7.1 and 50 percent of the scores falling between 5.9 and 7.7 points. Thus, it can be inferred that Tuesdays and Wednesdays are the best days of the week to fly because they have the lowest median prices and the highest median fly scores. Keep in mind that these findings are based on flight prices for days of the week over the entire year. Therefore, Tuesdays and Wednesdays are the best days of the week to book the cheapest flights with the best ratings regardless of what month it is.
Contrastingly, Thursdays and Sundays had the worst fly scores with median scores of 6.5 and 6.4 respectively.
This is consistent with the distribution of flight prices previously shown where Thursdays and Sundays had the highest median flight prices. Therefore, based on prices and fly scores, it is safe to say that Thursdays and Sundays are the worst days to fly out of NYC because they have the highest median flight prices as well as the lowest median fly scores out of any given day of the week.
Flight Prices by Airline
Now that we've identified the best months and days of the week to purchase flights, let us consider how flight prices are distributed amongst our participating airlines.
For all flights out of NYC at or below $500, Frontier airlines had the lowest projected median flight price of $119 while WestJet had the highest projected median flight price of $182. Of course, this is an oversimplification since every airline shown above doesn't service every destination included in this dataset. As a result, this analysis is more useful when a single destination is chosen in order to equally compare the "best" and "worst" airlines.
The example above shows the distribution of prices for all flights $500 or below, departing NYC and arriving in LAX. Here, there are only four airlines that service flights from NYC to LAX (American, Alaska, United, and JetBlue). Amongst these four airlines, Alaska airlines has the lowest projected median flight price of $154 while United airlines has the highest projected median flight price of $217. Therefore, if your desired destination is Los Angeles and you have a choice between these four airlines, then the cheapest overall option would be to fly with Alaska airlines.
Based on the five core analyses I conducted on this data scraped from Tripadvisor, my suggestion for consumers seeking the cheapest flights with the highest ratings is follows:
- Book flights in the months of September or May
- Book flights that depart on either a Tuesday or Wednesday
- If your desired destination is Los Angeles, book a flight with Alaska airlines if possible
So, if your next vacation includes Los Angeles, Chicago, Miami, San Juan, or Montreal as possible destinations, strongly consider this extensive analysis before you purchase that flight. Thank you for your time and safe travels!