Data Science Analysis of Scraped TripAdvisor Reviews
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Check out my source codes from: https://github.com/tdchoi7/Web_Scrape_Proj
Data Science Background
Good reviews are important for any business today, especially for city attractions and dining. Potential customers rely on the information provided by previous visitors. Although the tourism industry already has a means of gauging business through revenue or ticket sales, one potentially untapped method of gauging an attraction’s value and boosting sales could lie in the words of reviewers.
The written word can have a great impact on future visitors – both in number and overall sentiment. Therefore, an analysis of reviews of the top attractions in a city can reveal what people like and what they find wanting. Based on that insight, proprietors can respond to popular demand and increase revenue.
Data Analytic Method
To make the data set manageable, 400 reviews were scraped on December 12, 2020 from each of the 5 top attractions of four major cities: Boston, Chicago, Los Angeles, and New York City. Attraction names, city, review posted date, attraction visit date, number of user reviews, number of user helpful votes, number of review helpful votes, ratings, reviews, review titles, username, and user location were scraped using a combination of Scrapy and Selenium.
Scrapy alone did not suffice because it was unable to properly expand TripAdvisor’s textbox at the time of scraping. In total, there were 38,294 rows and 12 columns. A single sample row is shown in Figure 1 with 12 features.
Figure 1: Example of Original 12 columns
There was a slight issue when scraping attractions in New York, which was most likely due to computer memory capacity at the time. The 9/11 Memorial and Central Park reviews were scraped again separately and added into the DataFrame using Pandas.
Prior to analysis, the null values for visited dates were filled as the posted date, and the number of features was expanded to 16 to include all possible fields for a user’s location. If the user’s location included a state, the abbreviation was used as the replacement. Prior to analyzing the attractions in the city of Boston specifically, ratings of 1 to 3 stars were grouped into a more over-encompassing rating of “Poor” since there were not enough reviews with 1, 2, or 3 stars to do the analysis well.
Analysis was performed mainly using basic Natural Language Processing (NLP) and Sentiment Analysis. Count of single words, a pair of words (bigram), and a triad of words (trigram) were graphed for all reviews scraped for Boston. Single words generally contained names or words associated with attractions, but bigrams showed a visually interesting trend (Graph 1) where certain word pairings such as were found more often for reviews giving poor ratings.
This pattern was most apparent for the word pairing: “gift” and “shop.” All but one of the poorly rated reviews containing the word pairing of “gift” and “shop” were for The Boston Tea Party Ships & Museum (BTPSM) as noted in Graph 2.
Graph 1: Bigram Count by Rating
Graph 2: Count of Word Pairing for “Gift” and “Shop” in Poor Reviews
Boston Tea Party Ships & Museum's Gift Shop Analysis
Considering that the word pairings were so often mentioned in reviews that gave poor ratings, the expectation could have been that the gift shop needed much improvement. However, closer inspection of the actual reviews indicated that the gift shop was not actually the issue but rather that the gift shop was the best that BTPSM had to offer. One example is shown in Figure 2 below.
Figure 2: Review of Poor Remarks for BTPSM by DavvaW
Despite Graph 3 showing that negative reviews for BTPSM were not that negatively polar, the review in Figure 2 indicates that the attraction itself was not good and that the gift shop was the only part of the attraction that was worth the user’s while. In fact, a careful read-through of the reviews rating BTPSM poorly would demonstrate that the overall sentiment regarding the attractions leaned towards the negative.
Sentiment Analysis was unable to distinguish the negative views of the main attraction and the positive views of the gift shop, which reinforces the importance of the ratings and encourages combing through the reviews. In the one poorly rated review for Fenway Park that contained the pairing “gift” and “shop,” the reviewer purchased an upgrade of the tour allowing them on the field, but the activity on the field seemed limited (Figure 3).
Graph 3: Polarity of Reviews Giving BTPSM Poor Ratings
Figure 3: Review with “Gift” and “Shop” Pairing Giving Fenway Park a Poor Rating
Fenway Park Tour Analysis
A closer look at the comments for Fenway Park seemed to indicate that the reviews for games tended to be better than reviews for the tour. This is more apparent when delving into the reviews that rated attractions poorly and mentioned the word pairing “tour” and “guide” (Graph 4). Most of the reviews mentioned some issues with the tour guide including lack of the guide’s awareness or training. Other reviews mentioned a lack of substantial experience (such as going to the dugout or seeing the press box) of the park during the tour. At times, there were complaints of a lack of planning and scheduling the tours (Figure 4).
Graph 4: Polarity of Reviews Containing "Tour" and "Guide"
Figure 4: Review with “Tour” and “Guide” Pairing Giving Fenway Park a Poor Rating
An interesting observation to note was for the trigram for “waste,” “money,” and “time.” This particular trigram appears in only three reviews for Fenway Park and BTPSM, though it may reflect a possible need for change in these two attractions.
Possible Improvements for Attractions
Possible improvements for BTPSM, despite its being a money-making attraction, would be improving time constraints and scheduling. Some complaints in reviews mentioned not being able to see all the artifacts in the museum. Staggering times for tourists with and those without children could also help but would require actors who would know the history of the Tea Party well enough to keep adults informed about the history behind the attraction.
For Fenway, allowing tourists to see the hidden aspects of the field and park could help improve tourists’ view of the park since the tour seems to be less of an experience than an actual baseball game. Adding a part of the tour that allows visitors to visit the dugout, locker rooms, or past Hall of Famers and trophies would be a good addition to the routine.
Other possible ways to incorporate the experience of a ballgame in the tour could be discounted tickets to a Red Sox game with the purchase of a tour ticket. Having players practice or warm-up during a tour or even having someone from management address the tour group could also help improve the experience. Proper training of tour guides or providing tour guides with notes could help the experience even more.
Possible Routes for Further Data Analysis
In the future, analyzing the patterns of how reviews change over time and looking for repetitive patterns or new patterns that develop could help attractions increase revenue. Also, analyzing possible responses to reviews throughout the years could give an indication of whether the attractions have been taking proper steps to increase revenue. Even if attractions are not focused on reviews, noting trends could help more when coupled with targeted fixes based on reviews and ratings.