Restaurant Reviews as Foodborne Illness Indicators

Posted on Dec 6, 2015

Contributed by Brain Saindon. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).


CDC estimates that every year, nearly 1 in 6 Americans get sick from  a foodborne illness.  If you have ever experienced this type of sickness, you understand that foodborne illness is serious and unpleasant.  In fact the CDC also estimates that 3,000 Americans die from a foodborne illness every year. You may have experienced a foodborne illness shortly after visiting a new restaurant and have decided to never again go back to that restaurant.  During your two days of sickness you may have wished that you had decided to go to an alternative restaurant to prevent your sickness from happening.  Is there any way that you could have known whether you would have experienced a foodborne illness from that restaurant before eating there?

This is a difficult question to tackle using only publicly available sources.  However, we can reframe our question into one that we can more directly answer with publicly available data:  Can you use restaurant reviews as a public health indicator?  More specifically, can you use publicly available restaurant ratings to infer how well a restaurant handles their food?  This project leverages publicly available data to explore whether there is a relationship between a restaurants rating and their food handling practices by asking two main questions.


  1. Do restaurants with A food grades (administered by the NYC DOH) have higher restaurant ratings compared to restaurants with B food grades?
  2. Do restaurants with a higher count of words associated with foodborne illness in their reviews tend to have higher restaurant ratings compared to restaurants with a lower count of words associated with foodborne illness?


  1. Using 106 restaurants as a sample size, these results show that restaurants with higher restaurant grades (administered by the NYC DOH)  tend to have higher restaurant ratings.  The graph below shows that B graded restaurants have an average rating of 3.7 whereas A grade restaurants have an average rating of 4.25.rrpic1
  2. Using the same sample size of 106 restaurants, these results show that restaurants with a no words related to foodborne illnesses in their comments tend to have higher average restaurant ratings as compared to restaurants that have 1 or more words related to foodborne illnesses in their comments.  The graph below shows that the average rating for restaurants with no foodborne illness flags have is about 4.8 whereas the average rating for restaurants with more than 1 foodborne illness flag is around 3.6.



In order to answer the two main question, I used two primary data sources  from the NYC DOH and yelp.  I used Python for the entire process from data extraction to descriptive analysis.  Below outlines specific data sources and the specific code flow used to develop this project.  My python code is available on Github.

Data Sources

Code Approach - Python

  1. Scrape the 20 most recent yelp reviews from the top 200 restaurants in Manhattan.  This led to a total of 3,988 reviews from 200 restaurants.
  2. Create a Foodborne Illness flag if any of the following words associated with foodborne illness appeared in the comment text: ['ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']
  3. Download Restaurant grade data and select the latest restaurant grade per restaurant.
  4. Merge restaurant reviews to restaurant grades data by phone. 106 restaurants matched.
  5. Perform summary statistics to answer the projects main questions.

Python Code Available On Github

Python Code Details

See below for snippets of my python code developed for this project.  For a complete view of my code, click on my github link above.

Initially, I import several packages which will be leveraged downstream:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

First, I create a 'master list' of the URLs required to extract the names of the to 200 restaurants on Yelp.  This list will run through a downstream code.

master_list = set(list([',+NY&start=0',                 

Next, I run the above url list through a for loop in order to create a list of urls which will include the direct link to each restaurant's first page of reviews.

final_url_list =[]
for i in master_list:
    tmp = requests.get(i).text
    tmp = BeautifulSoup(tmp)
    url_list = []
    for a in'a[href^="/biz/"]'):
    url_list = sorted(set(url_list),key=url_list.index)  
    string = ''
    string2 = '?sort_by=date_desc'
    url_list = [string + x + string2 for x in url_list]

Once I have a list of the urls for the top 200 restaurants, I run the final_url_list through a loop and leverage BeautifulSoup in order to extract relevant data for this analysis.  Specifically, the below loop will extract the following information from each restaurant's top 20 reviews:

df_list = pd.DataFrame(columns = ['restaurant'])
for k in final_url_list:        
        reviews = requests.get(k).text
        reviews = BeautifulSoup(reviews)
        restname1 = reviews.find('h1', {'itemprop': 'name'}).get_text().strip()
        comments1 = reviews.find_all('p', {'itemprop': 'description'})
        comments_string = [tag.get_text() for tag in comments1]
        comment_dt1 = reviews.find_all('meta', {'itemprop': 'datePublished'})
        comments_dt_string = [tag['content'] for tag in comment_dt1]
        rating_value = reviews.find_all('meta', {'itemprop': 'ratingValue'})
        rating_string = [tag['content'] for tag in rating_value]
        user_location = reviews.find_all('li', {'class': 'user-location'})
        user_location_string = [tag.get_text() for tag in user_location]

        postal_code = reviews.find('span', {'itemprop': 'postalCode'}).get_text().strip()
        phone = reviews.find('span', {'class': 'biz-phone'}).get_text().strip()     
        phone = "".join(_ for _ in phone if _ in "1234567890") 
        df = zip(comments_dt_string, comments_string, rating_string, user_location_string)
        df = pd.DataFrame(df)
        df['restaurant'] = restname1
        df['postal_code'] = postal_code
        df['phone'] = phone
        df_list = df_list.append(df)

The next lines of code will create a foodborne illness flag if any of the following words appear in a restaurants review: [' ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']

df_list.columns = ['date', 'comment', 'rating', 'userloc', 'phone', 'zip', 'restaurant']
mylist = [' ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']
pattern = '|'.join(mylist)
df_list['fbi_presence'] = df_list.comment.str.contains(pattern)

A necessary part of this analysis, is the NYC DOH data which contains all information on the restaurant grades within the city. The code below calls in this data as a pandas dataframe.

import pandas
doh_data = pandas.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results.csv')

For each restaurant, I only want the latest restaurant grade. The code below selects the latest restaurant grade administered to each restaurant.

doh_data = pd.DataFrame(doh_data)
doh_data['phone'] = doh_data['PHONE']
doh_data.sort(['CAMIS', 'GRADE DATE'], ascending=[True, False])
doh_data_1 = doh_data.drop_duplicates('CAMIS') 
doh_data_1 = pd.DataFrame(doh_data_1)

Time to match the data sources (ratings and DOH restaurant grades) so that we can prepare begin our analysis:

reviews_grades = pd.merge(df_list, doh_data_1, on='phone', how='inner')
reviews_grades = pd.DataFrame(reviews_grades)
reviews_grades = reviews_grades[['CAMIS', 'rating','fbi_presence', 'date', 'comment', 'userloc', 'phone', 'zip', 'restaurant', 'GRADE', 'RECORD DATE', 'GRADE DATE']]

In order to move forward with the analysis, I must obtain each restaurant's average rating and the number of comments containing a foodborne illness flag for each restaurant:

restaurant_mean_rating = df_list['rating'].groupby(df_list['phone']).mean()
restaurant_fbi_count = df_list['fbi_presence'].groupby(df_list['phone']).sum()
restaurant_fbi_count = pd.DataFrame(restaurant_fbi_count)
restaurant_mean_rating = pd.DataFrame(restaurant_mean_rating)
restaurant_mean_rating['phone'] = restaurant_mean_rating.index
restaurant_fbi_count['phone'] = restaurant_fbi_count.index

The below code merges both summary measures into one data frame:

summary_merge = pd.merge(restaurant_mean_rating, restaurant_fbi_count, on='phone', how='inner')
summary_merge = pd.merge(summary_merge, doh_data_1, on='phone', how='inner')

Now that we have the data ready for analysis, we first create a histograms of average restaurant rating group by restaurant grades. This is the first image presented above.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15, 10))
sns.kdeplot(summary_merge[summary_merge.GRADE == 'A'].rating, shade=True, label='A Grade')
sns.kdeplot(summary_merge[summary_merge.GRADE == 'B'].rating, shade=True, label='B Grade')
#sns.kdeplot(summary_merge[summary_merge.GRADE == 'Z'].rating, shade=True, label='Z')
plt.xlabel('YELP RATING', fontsize=20)
plt.ylabel('DENSITY', fontsize=20)
plt.title('Average Yelp Rating: Grouped by Restaurant Grades (n=106)', fontsize=30)
plt.legend(loc='upper left', frameon=True, fontsize=20)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

For a second visualization, I created another histogram plot of average restaurant ratings grouped by number of comments which included a foodborne illness flag.

import seaborn as sns
%matplotlib inline
plt.figure(figsize=(20, 10))
sns.kdeplot(summary_merge[summary_merge.fbi_presence == 0].rating, shade=True, label='0 FBI Flag')
sns.kdeplot(summary_merge[summary_merge.fbi_presence == 1].rating, shade=True, label='1 FBI Flag')
sns.kdeplot(summary_merge[summary_merge.fbi_presence > 1].rating, shade=True, label='>1 FBI Flag')
plt.xlabel('YELP RATING', fontsize=20)
plt.ylabel('DENSITY', fontsize=20)
plt.title('Average Yelp Rating: Grouped by FBI Flags (n=106)', fontsize=30)
plt.legend(loc='upper left', frameon=True, fontsize=20)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

Conclusion & Next Steps

The results of this project suggest that average restaurant ratings may indicate how well a restaurant performs during a food inspection administered by the NYC Department of Health.  Additionally, we see that restaurants with a higher amount of foodborne illness flags (as derived from the text within a restaurant's reviews) tend to have lower average restaurant ratings.  The results of this project only suggest that restaurant ratings may be correlated with foodborne illness outbreaks.   In order to further investigate a possible association between restaurant reviews and foodborne illness outbreaks, additional statistical and study design methods must be considered to improve the validity and robustness of this project:

  • More Data
    • Additional review sources: Incorporate restaurant reviews from other sources in addition to yelp.
    • Additional Restaurants: Include more than 200 restaurants in order to have a bigger sample size.
    • Improve match rate between restaurant grade data and restaurant review data: Use restaurant name and address.
  • Refine foodborne illness flag (FBI) classification: Perform a more thorough literature review to identify the best words to use for the FBI flag derivation.
  • Apply additional statistical methods  and machine learning techniques:
    • Apply t-test/ANOVA to identify whether the difference in average restaurant rating is statistically different between restaurant grade groups/fbi flag groups.
    • Explore the ability to prospectively predict foodborne illness outbreak or restaurant grade given previous restaurant review ratings.  Potential supervised algorithms that will add value to this recommendation are:
      • Logistic Regression
      • Classification Tree
      • Support Vector Machine

It is important to not assume that poor restaurant reviews indicate whether an individual will experience a foodborne illness from these results.  With the above recommendations implemented, one can move closer to identifying whether restaurant ratings are statistically associated with foodborne illness outbreaks.

About Author

Brian Saindon

As a Health Data Scientist, Brian Saindon (MPH) leverages innovative data science tools to identify underlying patterns within healthcare systems. As a Health Data Analyst for Predilytics, he applied advanced statistics to predict disease likelihood, member disenrollment, member...
View all posts by Brian Saindon >

Leave a Comment

Google May 15, 2021
Google We came across a cool web page that you simply could appreciate. Take a search in case you want.
Google February 1, 2021
Google Although internet sites we backlink to beneath are considerably not related to ours, we feel they're essentially really worth a go by, so have a look.
Google October 7, 2019
Google Here are some hyperlinks to web sites that we link to due to the fact we think they are worth visiting.
Google September 14, 2019
Google Below you’ll locate the link to some websites that we feel you must visit. January 5, 2018
Hi there, this weekend is pleasant for me, for the reason that this point in time i am reading this great educational piece of writing here at my house.
dreadlock remover September 17, 2017
Additionally, there's a 20% tax on the gross gaming revenues as nicely company tax of 28%.
{milf September 17, 2017
Chips, on-line, roulette recreation. Legit knowledge entry job one of the best. In a position to updatestar obtain poker de resultados clave del european tour interview.
flight itinerary for Schengen visa September 17, 2017
Betsson´s main poker software program supplier, Microgaming, is the proud recipient of the coveted Poker Software program of the Yr award on the eGR B2B 2013 Awards.
midrag September 11, 2017
Avec des purchase-ins allant de zero,50 € à 5 000 €, tous les joueurs y trouvent leur compte.
everlast September 10, 2017
I am not receiving credits for jackpot spins on No restrict texas holdem.
safflower oil hype September 10, 2017
Voir notre rubrique sites de poker autorisés pour en savoir plus.
mens linen beach wedding clothes September 9, 2017
Les mises à jour sont immédiatement effectuées.
Jay Wigdore September 9, 2017
At the final check on 2016-06-02, web site load time was zero.seventy three. The highest load time is zero.83, the lowest load time is 0.fifty nine, the typical load time is zero.73.
Garridan Baldomino September 7, 2017
Jouer comporte des risques: endettement, dépendance, isolement.
everlast construction blog September 5, 2017
Les joueurs peuvent choisir de jouer aux jeux gratuits.
tow truck vancouver August 31, 2017
Latereg is a poker news website supplied to Devilfish Poker & Grassroots Poker League.
ariel baleli August 30, 2017
C'est le moyen idéal pour booster votre bankroll gratuitement.
underground elephant jason kulpa August 27, 2017
En général, un bon tournoi de poker est celui compte un important nombre de joueurs.
emergency electrician Perth August 27, 2017
Le Texas Maintain'em peut aussi se jouer à 5 ou à 6 joueurs dans le cas d'un tournoi Quick-Handed et à 9 et à 10 joueurs dans le cas d'un Full Desk. Dans une partie d'Omaha, on distribue 4 cartes privées et cinq cartes communes aux concurrents.
paginas web August 27, 2017
Afin que les débutants puissent, rapidement, tout comprendre des sujets traités, je présente des synthèses.
antenatal pregnancy care August 26, 2017
Pour les joueurs français, les meilleurs sites de poker se caractérisent en premier temps par les bonus de bienvenue.
digitalagency August 19, 2017
Nous nous sommes alliés à PokerStars pour nous assurer que nos joueurs reçoivent les meilleures offres possibles.
Giorgio Campo August 15, 2017
Il a donc été renommé « Huge Occasion » et vient de se terminer sur une superbe victoire que l'américain Victor Ramdin a remportée sur son adversaire du tête-à-tête, le fameux joueur australien Joe Hachem.
liftmd August 4, 2017
Heads up , although deepish dint final lengthy. I remember Roshan the match director asking us if we had been playing poker or table tennis.
vagina August 3, 2017
Sa nouvelle formule est justement quelque selected de très ambitieux automotive il ne s'agit pas seulement de ré habiller le périodique ou d'améliorer encore plus la qualité des illustrations, il s'agit surtout de procurer aux lecteurs de nouveaux articles plus profonds et ‘humains' sur le monde du poker, les joueurs, les techniques, ceux-ci fruits de la plume des plus grands consultants de France et d'ailleurs.
San Diego Employment Lawyer August 3, 2017
Recevez vos bonus à l'occasion de votre premier dépôt et profitez des nombreux avantages de notre programme de fidélité, le Club VIP !
decision August 1, 2017
Les outils collectant automatiquement les données sur des events et des tables où vous n'avez pas joué, autrement dit les logiciels espions qui collectent des données sur les différents tables que le joueur n'aurait pas pu collecter par lui-même.
Vous subirez en retour les insultes des autres joueurs.
Reuben Singh July 28, 2017
You'll be able to then import these hands rapidly into your HUD software.
kit de uñas de gel July 28, 2017
Overview: Betcoin Poker is a new alt currency poker site. They are distinctive in that they provide gamers the choice to deposit each bitcoins and litecoins.
trendsosyal July 27, 2017
That will help you know which internet sites to keep away from we maintain a listing of unsafe or disreputable websites. Take a look at the most recent additions to our blacklist before selecting a real money poker website. July 26, 2017
Once it was comfy with the procedures, PokerStars submitted its utility on Might nineteenth and was able to meet all of the stringent ARJEL license requirements. Though PokerStars is now available for French customers, there are some restrictions on online poker in France.
fraud July 26, 2017
Even at online gaming websites, many video games provide potential returns of properly over ninety nine%, making them some of the best video games around for gamers who want to have a superb opportunity of popping out on prime towards the casino.
floorprotection July 25, 2017
Every facet of the poker website is analyzed including its random quantity generator and encryption course of. And every poker room reviewed and listed by PokerListings has acquired certification from a licensed online gaming regulatory body.
professional restraint fitter July 25, 2017
Also a quick announcement in regards to the PGOL collection. We've got decided to postpone the PGOL series considering the participant suggestions we obtained throughout IPC.
current affairs July 24, 2017
Partenaire avec le prestigieux journal ‘Card Player' américain il a tout ce qu'il faut pour plaire et attirer les amateurs de poker.
courier tracking July 24, 2017
Betfair Poker is the flagship poker room of one of Europe's largest sportsbooks. July 23, 2017
Jeux casino 770 machine a make.
web hosting July 23, 2017
Ainsi les bons joueurs de tournois en jouent-ils plusieurs en même temps durant le samedi et le dimanche, et aucun d'entre eux ne serait assez fou pour manquer ceux de PokerStars.