Are Yelp Reviews Harsher in Your City?

Sammy Dolgin
Posted on Oct 21, 2019

Since moving to New York City a month ago, I've been able to derive many insights regarding the cultural differences between the Midwest (where I had spent my whole life prior to the move) and the Big Apple. The people here are direct and straight to the point. The walking speed is empirically and undeniably more brisk. The weather? Similarly volatile. But what stood out to me off the bat were the patterns I noticed in the way people review restaurants on Yelp.

Yelp is one of the largest crowd-sourced review forums on the internet. Most notably, Yelp is extremely convenient if you're trying to figure out where you should go eat. Users have the opportunity to rate any restaurant with a score between 0-5 "stars", including half star ratings such as 4.5, 3.5, and so on. Yelp aggregates and averages these scores and assigns each restaurant a single, all-encompassing score, which is then displayed to Yelp users as the official restaurant score. Needless to say, there is substantial incentive for restaurants to have a good score on display in order to attract customers.

Living in the Midwest, we all knew which restaurants were the fan favorites, and the Yelp scores typically reflected each city's affection. In New York, however, what I had observed in my first few weeks were that a lot of the staple, highly popular restaurants were often sporting labels of 4.0 or 3.5 stars. I've eaten enough folded pizza and bagels to know that the food in NYC is fantastic, so I suspected that this was much less about the quality of the food, and more-so a reflection of New Yorkers' tendency to be more critical of the restaurants. Thus, I decided to conduct research around the following premise: Do geographic, demographic, and/or cultural factors influence the way cities review their restaurants?


Approach 1: Comparing review ratings between the top 5 most populated U.S. cities


To start my analysis, I focused on comparing scores between the top 5 most populated U.S. cities: New York City, Los Angeles, Chicago, Houston, and Phoenix. The data that was scraped and analyzed encompasses the top 120 most reviewed restaurants within each city.



The above violin plot allows us to visually analyze the density at each rating by observing the width of each plot at that score. While the cities' distributions do follow a fairly similar pattern of normality around a 4.0 score, we do see some standout characteristics, such as a wider spread of scores for Houston's restaurants.



True to my initial hypothesis, we do observe New York restaurants as having the lowest average scores. Still, I knew there would be limitations to observing and comparing the means of ordinal data, so I wanted to look at the data with a more categorically based methodology.




The above bar chart clearly illustrates the centering of reviews around a 4.0 score. In fact, 61.6% of all reviews among these cities fell at exactly a 4.0. What immediately stood out to me was the lower concentration of New York reviews that were 4.5 stars or greater. NYC reviews saw just over 12% of their top 120 restaurants with a score of 4.5 stars or greater, compared to nearly 32% for Phoenix restaurants. To observe this more closely, I chose to expand my analysis to a regional level.


Approach #2: Comparing review ratings between regional clusters of urban populations



For my second approach, I took 20 cities with a population of 500,000 or more and broke them off into regional clusters, as illustrated above. Note that for simplicity, I utilized the "East" label, but it more accurately represents the Northeast region of the United States. Like before, I am observing the top 120 most reviewed restaurants per city, so our sample of restaurants per region is 600 total restaurants.



Similar to our findings with the individual cities, the East region exhibits the lowest average scores. Also in line with prior findings is the heightened standard deviation of southern scores, implying that highly populated cities in that region are delivering a generally wider range of scores. Once again, I wanted to drill into these findings by isolating each individual score as its own category.



Like the findings with New York City alone, The categorical breakdown of scores displays that the East region is giving the lowest proportion of ultra-high reviews. To make these findings more conclusive, I chose to compare the proportions of 4.5 star or greater ratings between East region cities and non-East region cities, as illustrated below.



The outcome of the test results in a Z-statistic of of -3.22, indicating that there is a 0.13% chance that the differences in these proportions can be attributed to randomness. Thus, we can comfortably reject the hypothesis that these populations are equal, and continue with the assumption that the East region does indeed rate a lower proportion of restaurants with a rating of 4.5 stars or more.



This point is further emphasized when looking at the proportions of these ratings on a per-city basis. 3 of the top 5 lowest proportions of high reviews are given by East region cities, with all Eastern cities giving these high ratings to less than 20% of their restaurants.



For my final observation within this approach, I wanted to take a look at ultra-low reviews, given that the East region exhibited the highest proportion of those. Seeing that this proportion is fairly skewed by the 9 ultra-low ratings in Baltimore, I chose to move forward without further statistical testing on these proportions. 


Approach #3: Comparing review ratings between mid-sized and large urban areas

For my final analytical approach, I wanted to focus on grouping the cities by population. More specifically, I organized a binary category that groups 20 cities under one "big city" umbrella (these are the same 20 cities with a population of 500,000 or more that we observed in Approach #2). I compared this grouping of cities against 20 "mid-sized" cities, all of which contain a population between 100,000 and 300,000 people. Some cities in this range include Boise, Lincoln, Little Rock, and Providence. Once again, I am looking at the top 120 most reviewed restaurants per city.



Unlike our previous violin plots that only illustrated minute differences between categories, we immediately observe clear differences in this plot. Namely, the big city category displays a very dense concentration of 4.0 scores when compared to the mid-sized cities, which exhibit a much wider range. This is further shown when comparing the standard deviations of the groups, where mid-sized cities have over a 25% greater standard deviation of scores than the big cities.



Broken down into a ranked list of cities, we see that big cities exhibit a slightly higher mean review than mid-sized cities. It's not until we gather further insights about the variance within the data that we understand the full picture.



We see above that the mid-sized city with the lowest standard deviation, Salt Lake City, still contains a standard deviation that is higher than the standard deviation of 12 the big cities. What's more is that of the 6 cities with the lowest standard deviation overall, 4 of them fall within the top 5 most populated U.S. cities. Ultimately, we observe a clear inverse relationship between a cities population and the variance within it's restaurants' review scores. This falls in line with standard statistical theory that an increased population (which in this case is not just just the quantity of people but also the quantity of reviews) yields a distribution of values that is narrower, and more convergent upon a population's mean. 



To bring further evidence to this point, I conducted a Chi-square goodness of fit test. Here, I compared the ratio of proportions between the two population groupings. I wanted to compare the proportion of low (< 4.0 stars), medium (4.0 stars exactly), and high (> 4.0 stars) scores between the groups, with the understanding that the proportion of medium scores were much more dense within the big cities we measured, and that both low and high scores were more densely observed within the mid-sized cities.



The test resulted in a chi-square statistic of 164.20. In this instance, our p-value indicates that there is a probability of 2.22 * 10-36 that the differences in the proportions can be attributed to randomness, thus providing statistical evidence regarding the differences in the populations' proportions of scoring levels.



Finally, I wanted to observe the categorical differences of scoring proportions when broken down on a regional basis. Generally, each region carried over the characteristics of the full population that we observed earlier, with mid-sized cities having more low scoring reviews across all regions, big cities having more 4.0 reviews, and in all except the Southern region, we observed mid-sized cities containing more high scoring reviews. This, coupled with evidence from previous approaches, does allow for some evidence that large Southern cities contain characteristics that are similar to those of mid-sized cities.




There were a large variety of insights derived from my analysis, but more generally, I was comfortable concluding the following:


1. Geographic, demographic, and cultural factors do have influence on the way cities review their restaurants.

2. Major U.S. cities (population of 500,000 or above) in the Northeastern region of the U.S. are less likely than other regions to give a popular restaurant (top 120 most reviewed) a score of 4.5 stars or above.

3. Restaurants in mid-sized U.S. cities (population of 100,000 to 300,000) generally experience a wider range of review scores than major U.S. cities, which center more closely around a 4.0 rating.


After living in New York City for nearly a month, the initial hypothesis that I formed about seeing lower restaurant ratings, while informal and potentially biased off of a small sample of restaurants, did appear to have merit to it. I know now that the next time I dismiss a popular ramen restaurant because it doesn't have 4.5 stars, I'll have to think twice, understanding now how sparsely that score is actually given around these parts. If a restaurant wants a really high rating from a New Yorker, they'll have to do a little extra to earn it.


Sammy Dolgin is a student at the NYC Data Science Academy, and a graduate of Loyola University Chicago's Quinlan School of Business. He can be contacted at [email protected] All Python code used within this project for web scraping, data cleaning, analysis, and visualization can be found here.


About Author

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp