Data Science Analysis of Scraped TripAdvisor Reviews

Posted on Dec 14, 2021

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Check out my source codes from: https://github.com/tdchoi7/Web_Scrape_Proj

 

Data Science Background

Good reviews are important for any business today, especially for city attractions and dining. Potential customers rely on the information provided by previous visitors.  Although the tourism industry already has a means of gauging business through revenue or ticket sales, one potentially untapped method of gauging an attraction’s value and boosting sales could lie in the words of reviewers.

The written word can have a great impact on future visitors – both in number and overall sentiment. Therefore, an analysis of reviews of the top attractions in a city can reveal what people like and what they find wanting. Based on that insight, proprietors can respond to popular demand and increase revenue.

Data Analytic Method

To make the data set manageable, 400 reviews were scraped on December 12, 2020 from each of the 5 top attractions of four major cities: Boston, Chicago, Los Angeles, and New York City. Attraction names, city, review posted date, attraction visit date, number of user reviews, number of user helpful votes, number of review helpful votes, ratings, reviews, review titles, username, and user location were scraped using a combination of Scrapy and Selenium.

Scrapy alone did not suffice because it was unable to properly expand TripAdvisor’s textbox at the time of scraping. In total, there were 38,294 rows and 12 columns. A single sample row is shown in Figure 1 with 12 features.

data for tripadvisor reviews

data for tripadvisor reviews

Figure 1: Example of Original 12 columns

There was a slight issue when scraping attractions in New York, which was most likely due to computer memory capacity at the time. The 9/11 Memorial and Central Park reviews were scraped again separately and added into the DataFrame using Pandas.

Prior to analysis, the null values for visited dates were filled as the posted date, and the number of features was expanded to 16 to include all possible fields for a user’s location. If the user’s location included a state, the abbreviation was used as the replacement. Prior to analyzing the attractions in the city of Boston specifically, ratings of 1 to 3 stars were grouped into a more over-encompassing rating of “Poor” since there were not enough reviews with 1, 2, or 3 stars to do the analysis well.

Data Analysis

Analysis was performed mainly using basic Natural Language Processing (NLP) and Sentiment Analysis. Count of single words, a pair of words (bigram), and a triad of words (trigram) were graphed for all reviews scraped for Boston. Single words generally contained names or words associated with attractions, but bigrams showed a visually interesting trend (Graph 1) where certain word pairings such as were found more often for reviews giving poor ratings.

This pattern was most apparent for the word pairing: “gift” and “shop.” All but one of the poorly rated reviews containing the word pairing of “gift” and “shop” were for The Boston Tea Party Ships & Museum (BTPSM) as noted in Graph 2.

Graph 1: Bigram Count by Rating

data for tripadvisor reviews

data for tripadvisor reviews

Graph 2: Count of Word Pairing for “Gift” and “Shop” in Poor Reviews

Boston Tea Party Ships & Museum's Gift Shop Analysis

Considering that the word pairings were so often mentioned in reviews that gave poor ratings, the expectation could have been that the gift shop needed much improvement. However, closer inspection of the actual reviews indicated that the gift shop was not actually the issue but rather that the gift shop was the best that BTPSM had to offer. One example is shown in Figure 2 below.

Figure 2: Review of Poor Remarks for BTPSM by DavvaW

Despite Graph 3 showing that negative reviews for BTPSM were not that negatively polar, the review in Figure 2 indicates that the attraction itself was not good and that the gift shop was the only part of the attraction that was worth the user’s while. In fact, a careful read-through of the reviews rating BTPSM poorly would demonstrate that the overall sentiment regarding the attractions leaned towards the negative.

Sentiment Analysis was unable to distinguish the negative views of the main attraction and the positive views of the gift shop, which reinforces the importance of the ratings and encourages combing through the reviews. In the one poorly rated review for Fenway Park that contained the pairing “gift” and “shop,” the reviewer purchased an upgrade of the tour allowing them on the field, but the activity on the field seemed limited (Figure 3).

Graph 3: Polarity of Reviews Giving BTPSM Poor Ratings

Figure 3: Review with “Gift” and “Shop” Pairing Giving Fenway Park a Poor Rating

Fenway Park Tour Analysis

A closer look at the comments for Fenway Park seemed to indicate that the reviews for games tended to be better than reviews for the tour. This is more apparent when delving into the reviews that rated attractions poorly and mentioned the word pairing “tour” and “guide” (Graph 4).  Most of the reviews mentioned some issues with the tour guide including lack of the guide’s awareness or training. Other reviews mentioned a lack of substantial experience (such as going to the dugout or seeing the press box) of the park during the tour. At times, there were complaints of a lack of planning and scheduling the tours (Figure 4).

Graph 4: Polarity of Reviews Containing "Tour" and "Guide"

Figure 4: Review with “Tour” and “Guide” Pairing Giving Fenway Park a Poor Rating

An interesting observation to note was for the trigram for “waste,” “money,” and “time.” This particular trigram appears in only three reviews for Fenway Park and BTPSM, though it may reflect a possible need for change in these two attractions.

Possible Improvements for Attractions

Possible improvements for BTPSM, despite its being a money-making attraction, would be improving time constraints and scheduling. Some complaints in reviews mentioned not being able to see all the artifacts in the museum. Staggering times for tourists with and those without children could also help but would require actors who would know the history of the Tea Party well enough to keep adults informed about the history behind the attraction.

For Fenway, allowing tourists to see the hidden aspects of the field and park could help improve tourists’ view of the park since the tour seems to be less of an experience than an actual baseball game. Adding a part of the tour that allows visitors to visit the dugout, locker rooms, or past Hall of Famers and trophies would be a good addition to the routine.

Other possible ways to incorporate the experience of a ballgame in the tour could be discounted tickets to a Red Sox game with the purchase of a tour ticket. Having players practice or warm-up during a tour or even having someone from management address the tour group could also help improve the experience. Proper training of tour guides or providing tour guides with notes could help the experience even more.

Possible Routes for Further Data Analysis

In the future, analyzing the patterns of how reviews change over time and looking for repetitive patterns or new patterns that develop could help attractions increase revenue. Also, analyzing possible responses to reviews throughout the years could give an indication of whether the attractions have been taking proper steps to increase revenue. Even if attractions are not focused on reviews, noting trends could help more when coupled with targeted fixes based on reviews and ratings.

 

About Author

Theodore

Theodore is a jack of many trades and an expert in overthinking. He has worked in healthcare, healthcare administration, and finance and has experience in medical research. Having volunteered with medical missions abroad, managed building a new primary...
View all posts by Theodore >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI