Do Tourist Review Differently on Yelp: A NLP Classification Task

Posted on Nov 6, 2018


Everyday millions of restaurant goers record their experiences and express their opinion on the web. Due to the enormous amount of money spent on tourism in the United States, identifying these tourist and their preferences is a worthy endeavor. So, my project started with the hypothesis that tourist to a region generally have different expectations, preferences , and satisfaction thresholds than a local from the same region. This should be reflective is the speech behavior and utterance patters of internet users. Therefore, using Yelp as a corpus to for sentiment analysis and opinion mining, I set out to confirm that there are inherent linguistic differences in reviews of tourists and locals that allow for traditional machine learning classifiers to accurately predict whether a review was written by a tourist. The workflow for this project is represented in the following workflow:

The Data: Gathering the Corpus

I decided to restrict my research to reviews of restaurants written in English by tourists in the United States. However, similar techniques could be applied to other languages and domains. To this goal, I decided to code a web crawler to scrap reviews from five of the most visited and highly populated areas to harvest reviews for my corpus:

  • New York, NY
  • Chicago, IL
  • Las Vegas, NV
  • Los Angeles, CA
  • Orlando, FL

This allows for the equal sampling of local and tourist reviews from a multitude of cities. The final results included 53,000 reviews across 42,800 URLs. In order not to over represent cities with more reviews I took a sample of 6,859 reviews from each area. While scraping I was able to populate the data with labels for training and testing, by comparing the user's location against the business the review is about.

Corpus Analysis

First I analyzed the distribution of word frequencies. The graph below shows the top forty most common words from the corpus as well as the size of the vocabulary (amount of unique words), and hapax lenomegon (amount of unique words with a frequency of one):

Then I checked that the distribution followed Zipf’s law which states the frequency of any word type in natural language corpora of utterances is inversely proportional to its rank in the frequency table. So, we can easily check Zipf's Law on the corpus of Yelp reviews by plotting the frequencies of the word types in rank order on a log-log graph.

I used the Stanford Log-linear Part-Of-Speech Tagger, a Java implementation of the log-linear maximum entropy part-of-speech taggers described in Toutanova et al. (2003), as a POS-tagger for the corpus.

This allowed me to do a pairwise POS-tag analysis on the corpus to find possible indicators for classification implemented from Pak & Paroubek (2010):

where NT_1 and NT_2 denote then number of tag T occurrences in local and remote reviews respectively. The following graph represents the P^T values for local vs. remote reviews:

The Standford POS-tagger uses the Penn Treebank tag set. To better interpret a tag’s abbreviation, descriptions can be found here

You can see that verbs in the singular present (VBZ, VBP) are more common in local reviews. While verbs in the simple past are more common in remote reviews. This is consistent with the intuition that locals are in a  more salient position to speak in the present or the future than tourists. Also the graph shows that adverbs (RBS) are more indicative of local speech. Adverbs can be used to express information about the time, manner, degree, place, frequency of an state or event. Many of these adverbs of time presuppose reoccurrence (e.g. ‘never’, ‘lately’, ‘often’, recently, ‘Tuesday’, ‘today’, ‘yet’, ‘soon’, ‘always’, etc.). So one conjecture can be made that people whose assumed audience are locals are more likely to speak with adverbs of this variety. This intuition will also be later confirmed once I introduce the notion of salience, which measures how likely a n-gram is to be in one category over its alternative category.

Predictive Modeling

For features, I constructed an n-gram model of the corpus that included both unigrams and bigram with a combined vocabulary of 434,103 n-grams. In order to increase accuracy and decrease noise I implemented Pak & Paroubek’s (2010) strategy of calculating salience to discriminate common n-grams. Salience is calculated as follows:

The following tables include some n-grams with a high salience:


disney​ 0.883721​ remote​
hotel​ 0.812500​ remote​
nyc​ 0.810526​ remote​
often​ 0.694444​ local​
parking​ 0.685000​ local​

I used a salience threshold to filter out common n-grams. Then, I built a classifier that used a n-gram model after and the distribution of POS frequencies to estimate whether a review was written by a tourist or a local. The final classifier was tested against a confusion matrix and produced the following confusion matrix normalized across true labels.

About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI