Restaurant Reviews as Foodborne Illness Indicators

Posted on Dec 6, 2015

Contributed by Brain Saindon. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).


CDC estimates that every year, nearly 1 in 6 Americans get sick from  a foodborne illness.  If you have ever experienced this type of sickness, you understand that foodborne illness is serious and unpleasant.  In fact the CDC also estimates that 3,000 Americans die from a foodborne illness every year. You may have experienced a foodborne illness shortly after visiting a new restaurant and have decided to never again go back to that restaurant.  During your two days of sickness you may have wished that you had decided to go to an alternative restaurant to prevent your sickness from happening.  Is there any way that you could have known whether you would have experienced a foodborne illness from that restaurant before eating there?

This is a difficult question to tackle using only publicly available sources.  However, we can reframe our question into one that we can more directly answer with publicly available data:  Can you use restaurant reviews as a public health indicator?  More specifically, can you use publicly available restaurant ratings to infer how well a restaurant handles their food?  This project leverages publicly available data to explore whether there is a relationship between a restaurants rating and their food handling practices by asking two main questions.


  1. Do restaurants with A food grades (administered by the NYC DOH) have higher restaurant ratings compared to restaurants with B food grades?
  2. Do restaurants with a higher count of words associated with foodborne illness in their reviews tend to have higher restaurant ratings compared to restaurants with a lower count of words associated with foodborne illness?


  1. Using 106 restaurants as a sample size, these results show that restaurants with higher restaurant grades (administered by the NYC DOH)  tend to have higher restaurant ratings.  The graph below shows that B graded restaurants have an average rating of 3.7 whereas A grade restaurants have an average rating of 4.25.rrpic1
  2. Using the same sample size of 106 restaurants, these results show that restaurants with a no words related to foodborne illnesses in their comments tend to have higher average restaurant ratings as compared to restaurants that have 1 or more words related to foodborne illnesses in their comments.  The graph below shows that the average rating for restaurants with no foodborne illness flags have is about 4.8 whereas the average rating for restaurants with more than 1 foodborne illness flag is around 3.6.



In order to answer the two main question, I used two primary data sources  from the NYC DOH and yelp.  I used Python for the entire process from data extraction to descriptive analysis.  Below outlines specific data sources and the specific code flow used to develop this project.  My python code is available on Github.

Data Sources

Code Approach - Python

  1. Scrape the 20 most recent yelp reviews from the top 200 restaurants in Manhattan.  This led to a total of 3,988 reviews from 200 restaurants.
  2. Create a Foodborne Illness flag if any of the following words associated with foodborne illness appeared in the comment text: ['ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']
  3. Download Restaurant grade data and select the latest restaurant grade per restaurant.
  4. Merge restaurant reviews to restaurant grades data by phone. 106 restaurants matched.
  5. Perform summary statistics to answer the projects main questions.

Python Code Available On Github

Python Code Details

See below for snippets of my python code developed for this project.  For a complete view of my code, click on my github link above.

Initially, I import several packages which will be leveraged downstream:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

First, I create a 'master list' of the URLs required to extract the names of the to 200 restaurants on Yelp.  This list will run through a downstream code.

master_list = set(list([',+NY&start=0',                 

Next, I run the above url list through a for loop in order to create a list of urls which will include the direct link to each restaurant's first page of reviews.

final_url_list =[]
for i in master_list:
    tmp = requests.get(i).text
    tmp = BeautifulSoup(tmp)
    url_list = []
    for a in'a[href^="/biz/"]'):
    url_list = sorted(set(url_list),key=url_list.index)  
    string = ''
    string2 = '?sort_by=date_desc'
    url_list = [string + x + string2 for x in url_list]

Once I have a list of the urls for the top 200 restaurants, I run the final_url_list through a loop and leverage BeautifulSoup in order to extract relevant data for this analysis.  Specifically, the below loop will extract the following information from each restaurant's top 20 reviews:

df_list = pd.DataFrame(columns = ['restaurant'])
for k in final_url_list:        
        reviews = requests.get(k).text
        reviews = BeautifulSoup(reviews)
        restname1 = reviews.find('h1', {'itemprop': 'name'}).get_text().strip()
        comments1 = reviews.find_all('p', {'itemprop': 'description'})
        comments_string = [tag.get_text() for tag in comments1]
        comment_dt1 = reviews.find_all('meta', {'itemprop': 'datePublished'})
        comments_dt_string = [tag['content'] for tag in comment_dt1]
        rating_value = reviews.find_all('meta', {'itemprop': 'ratingValue'})
        rating_string = [tag['content'] for tag in rating_value]
        user_location = reviews.find_all('li', {'class': 'user-location'})
        user_location_string = [tag.get_text() for tag in user_location]

        postal_code = reviews.find('span', {'itemprop': 'postalCode'}).get_text().strip()
        phone = reviews.find('span', {'class': 'biz-phone'}).get_text().strip()     
        phone = "".join(_ for _ in phone if _ in "1234567890") 
        df = zip(comments_dt_string, comments_string, rating_string, user_location_string)
        df = pd.DataFrame(df)
        df['restaurant'] = restname1
        df['postal_code'] = postal_code
        df['phone'] = phone
        df_list = df_list.append(df)

The next lines of code will create a foodborne illness flag if any of the following words appear in a restaurants review: [' ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']

df_list.columns = ['date', 'comment', 'rating', 'userloc', 'phone', 'zip', 'restaurant']
mylist = [' ill', 'foodborne', 'sick', 'vomit', 'sickness', 'poisoning', 'headache', 'fever', 'cramp']
pattern = '|'.join(mylist)
df_list['fbi_presence'] = df_list.comment.str.contains(pattern)

A necessary part of this analysis, is the NYC DOH data which contains all information on the restaurant grades within the city. The code below calls in this data as a pandas dataframe.

import pandas
doh_data = pandas.read_csv('DOHMH_New_York_City_Restaurant_Inspection_Results.csv')

For each restaurant, I only want the latest restaurant grade. The code below selects the latest restaurant grade administered to each restaurant.

doh_data = pd.DataFrame(doh_data)
doh_data['phone'] = doh_data['PHONE']
doh_data.sort(['CAMIS', 'GRADE DATE'], ascending=[True, False])
doh_data_1 = doh_data.drop_duplicates('CAMIS') 
doh_data_1 = pd.DataFrame(doh_data_1)

Time to match the data sources (ratings and DOH restaurant grades) so that we can prepare begin our analysis:

reviews_grades = pd.merge(df_list, doh_data_1, on='phone', how='inner')
reviews_grades = pd.DataFrame(reviews_grades)
reviews_grades = reviews_grades[['CAMIS', 'rating','fbi_presence', 'date', 'comment', 'userloc', 'phone', 'zip', 'restaurant', 'GRADE', 'RECORD DATE', 'GRADE DATE']]

In order to move forward with the analysis, I must obtain each restaurant's average rating and the number of comments containing a foodborne illness flag for each restaurant:

restaurant_mean_rating = df_list['rating'].groupby(df_list['phone']).mean()
restaurant_fbi_count = df_list['fbi_presence'].groupby(df_list['phone']).sum()
restaurant_fbi_count = pd.DataFrame(restaurant_fbi_count)
restaurant_mean_rating = pd.DataFrame(restaurant_mean_rating)
restaurant_mean_rating['phone'] = restaurant_mean_rating.index
restaurant_fbi_count['phone'] = restaurant_fbi_count.index

The below code merges both summary measures into one data frame:

summary_merge = pd.merge(restaurant_mean_rating, restaurant_fbi_count, on='phone', how='inner')
summary_merge = pd.merge(summary_merge, doh_data_1, on='phone', how='inner')

Now that we have the data ready for analysis, we first create a histograms of average restaurant rating group by restaurant grades. This is the first image presented above.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15, 10))
sns.kdeplot(summary_merge[summary_merge.GRADE == 'A'].rating, shade=True, label='A Grade')
sns.kdeplot(summary_merge[summary_merge.GRADE == 'B'].rating, shade=True, label='B Grade')
#sns.kdeplot(summary_merge[summary_merge.GRADE == 'Z'].rating, shade=True, label='Z')
plt.xlabel('YELP RATING', fontsize=20)
plt.ylabel('DENSITY', fontsize=20)
plt.title('Average Yelp Rating: Grouped by Restaurant Grades (n=106)', fontsize=30)
plt.legend(loc='upper left', frameon=True, fontsize=20)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

For a second visualization, I created another histogram plot of average restaurant ratings grouped by number of comments which included a foodborne illness flag.

import seaborn as sns
%matplotlib inline
plt.figure(figsize=(20, 10))
sns.kdeplot(summary_merge[summary_merge.fbi_presence == 0].rating, shade=True, label='0 FBI Flag')
sns.kdeplot(summary_merge[summary_merge.fbi_presence == 1].rating, shade=True, label='1 FBI Flag')
sns.kdeplot(summary_merge[summary_merge.fbi_presence > 1].rating, shade=True, label='>1 FBI Flag')
plt.xlabel('YELP RATING', fontsize=20)
plt.ylabel('DENSITY', fontsize=20)
plt.title('Average Yelp Rating: Grouped by FBI Flags (n=106)', fontsize=30)
plt.legend(loc='upper left', frameon=True, fontsize=20)
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 20)

Conclusion & Next Steps

The results of this project suggest that average restaurant ratings may indicate how well a restaurant performs during a food inspection administered by the NYC Department of Health.  Additionally, we see that restaurants with a higher amount of foodborne illness flags (as derived from the text within a restaurant's reviews) tend to have lower average restaurant ratings.  The results of this project only suggest that restaurant ratings may be correlated with foodborne illness outbreaks.   In order to further investigate a possible association between restaurant reviews and foodborne illness outbreaks, additional statistical and study design methods must be considered to improve the validity and robustness of this project:

  • More Data
    • Additional review sources: Incorporate restaurant reviews from other sources in addition to yelp.
    • Additional Restaurants: Include more than 200 restaurants in order to have a bigger sample size.
    • Improve match rate between restaurant grade data and restaurant review data: Use restaurant name and address.
  • Refine foodborne illness flag (FBI) classification: Perform a more thorough literature review to identify the best words to use for the FBI flag derivation.
  • Apply additional statistical methods  and machine learning techniques:
    • Apply t-test/ANOVA to identify whether the difference in average restaurant rating is statistically different between restaurant grade groups/fbi flag groups.
    • Explore the ability to prospectively predict foodborne illness outbreak or restaurant grade given previous restaurant review ratings.  Potential supervised algorithms that will add value to this recommendation are:
      • Logistic Regression
      • Classification Tree
      • Support Vector Machine

It is important to not assume that poor restaurant reviews indicate whether an individual will experience a foodborne illness from these results.  With the above recommendations implemented, one can move closer to identifying whether restaurant ratings are statistically associated with foodborne illness outbreaks.

About Author

Brian Saindon

As a Health Data Scientist, Brian Saindon (MPH) leverages innovative data science tools to identify underlying patterns within healthcare systems. As a Health Data Analyst for Predilytics, he applied advanced statistics to predict disease likelihood, member disenrollment, member...
View all posts by Brian Saindon >

Leave a Comment

Google May 15, 2021
Google We came across a cool web page that you simply could appreciate. Take a search in case you want.
Google February 1, 2021
Google Although internet sites we backlink to beneath are considerably not related to ours, we feel they're essentially really worth a go by, so have a look.
Google October 7, 2019
Google Here are some hyperlinks to web sites that we link to due to the fact we think they are worth visiting.
Google September 14, 2019
Google Below you’ll locate the link to some websites that we feel you must visit. January 5, 2018
Hi there, this weekend is pleasant for me, for the reason that this point in time i am reading this great educational piece of writing here at my house.
dreadlock remover September 17, 2017
Additionally, there's a 20% tax on the gross gaming revenues as nicely company tax of 28%.
{milf September 17, 2017
Chips, on-line, roulette recreation. Legit knowledge entry job one of the best. In a position to updatestar obtain poker de resultados clave del european tour interview.
flight itinerary for Schengen visa September 17, 2017
Betsson´s main poker software program supplier, Microgaming, is the proud recipient of the coveted Poker Software program of the Yr award on the eGR B2B 2013 Awards.
midrag September 11, 2017
Avec des purchase-ins allant de zero,50 € à 5 000 €, tous les joueurs y trouvent leur compte.
everlast September 10, 2017
I am not receiving credits for jackpot spins on No restrict texas holdem.
safflower oil hype September 10, 2017
Voir notre rubrique sites de poker autorisés pour en savoir plus.
mens linen beach wedding clothes September 9, 2017
Les mises à jour sont immédiatement effectuées.
Jay Wigdore September 9, 2017
At the final check on 2016-06-02, web site load time was zero.seventy three. The highest load time is zero.83, the lowest load time is 0.fifty nine, the typical load time is zero.73.
Garridan Baldomino September 7, 2017
Jouer comporte des risques: endettement, dépendance, isolement.
everlast construction blog September 5, 2017
Les joueurs peuvent choisir de jouer aux jeux gratuits.
tow truck vancouver August 31, 2017
Latereg is a poker news website supplied to Devilfish Poker & Grassroots Poker League.
ariel baleli August 30, 2017
C'est le moyen idéal pour booster votre bankroll gratuitement.
underground elephant jason kulpa August 27, 2017
En général, un bon tournoi de poker est celui compte un important nombre de joueurs.
emergency electrician Perth August 27, 2017
Le Texas Maintain'em peut aussi se jouer à 5 ou à 6 joueurs dans le cas d'un tournoi Quick-Handed et à 9 et à 10 joueurs dans le cas d'un Full Desk. Dans une partie d'Omaha, on distribue 4 cartes privées et cinq cartes communes aux concurrents.
paginas web August 27, 2017
Afin que les débutants puissent, rapidement, tout comprendre des sujets traités, je présente des synthèses.
antenatal pregnancy care August 26, 2017
Pour les joueurs français, les meilleurs sites de poker se caractérisent en premier temps par les bonus de bienvenue.
digitalagency August 19, 2017
Nous nous sommes alliés à PokerStars pour nous assurer que nos joueurs reçoivent les meilleures offres possibles.
Giorgio Campo August 15, 2017
Il a donc été renommé « Huge Occasion » et vient de se terminer sur une superbe victoire que l'américain Victor Ramdin a remportée sur son adversaire du tête-à-tête, le fameux joueur australien Joe Hachem.
liftmd August 4, 2017
Heads up , although deepish dint final lengthy. I remember Roshan the match director asking us if we had been playing poker or table tennis.
vagina August 3, 2017
Sa nouvelle formule est justement quelque selected de très ambitieux automotive il ne s'agit pas seulement de ré habiller le périodique ou d'améliorer encore plus la qualité des illustrations, il s'agit surtout de procurer aux lecteurs de nouveaux articles plus profonds et ‘humains' sur le monde du poker, les joueurs, les techniques, ceux-ci fruits de la plume des plus grands consultants de France et d'ailleurs.
San Diego Employment Lawyer August 3, 2017
Recevez vos bonus à l'occasion de votre premier dépôt et profitez des nombreux avantages de notre programme de fidélité, le Club VIP !
decision August 1, 2017
Les outils collectant automatiquement les données sur des events et des tables où vous n'avez pas joué, autrement dit les logiciels espions qui collectent des données sur les différents tables que le joueur n'aurait pas pu collecter par lui-même.
Vous subirez en retour les insultes des autres joueurs.
Reuben Singh July 28, 2017
You'll be able to then import these hands rapidly into your HUD software.
kit de uñas de gel July 28, 2017
Overview: Betcoin Poker is a new alt currency poker site. They are distinctive in that they provide gamers the choice to deposit each bitcoins and litecoins.
trendsosyal July 27, 2017
That will help you know which internet sites to keep away from we maintain a listing of unsafe or disreputable websites. Take a look at the most recent additions to our blacklist before selecting a real money poker website. July 26, 2017
Once it was comfy with the procedures, PokerStars submitted its utility on Might nineteenth and was able to meet all of the stringent ARJEL license requirements. Though PokerStars is now available for French customers, there are some restrictions on online poker in France.
fraud July 26, 2017
Even at online gaming websites, many video games provide potential returns of properly over ninety nine%, making them some of the best video games around for gamers who want to have a superb opportunity of popping out on prime towards the casino.
floorprotection July 25, 2017
Every facet of the poker website is analyzed including its random quantity generator and encryption course of. And every poker room reviewed and listed by PokerListings has acquired certification from a licensed online gaming regulatory body.
professional restraint fitter July 25, 2017
Also a quick announcement in regards to the PGOL collection. We've got decided to postpone the PGOL series considering the participant suggestions we obtained throughout IPC.
current affairs July 24, 2017
Partenaire avec le prestigieux journal ‘Card Player' américain il a tout ce qu'il faut pour plaire et attirer les amateurs de poker.
courier tracking July 24, 2017
Betfair Poker is the flagship poker room of one of Europe's largest sportsbooks. July 23, 2017
Jeux casino 770 machine a make.
web hosting July 23, 2017
Ainsi les bons joueurs de tournois en jouent-ils plusieurs en même temps durant le samedi et le dimanche, et aucun d'entre eux ne serait assez fou pour manquer ceux de PokerStars.
Rajesh Patel July 22, 2017
We switched chip lead nearly each hand.
cartier love or rose imitation July 21, 2017
This page really has all of the information and facts I needed about this subject and didn at know who to ask. cartier love or rose imitation
Lamar Hunt Jr July 17, 2017
Vous pourriez même vous retrouver avec la main gagnante qui vous emmène aux World Collection of Poker, et vous retrouver face à de gros gagnants tel Chris Moneymaker.
cremazione July 17, 2017
Les websites de poker en ligne en France ne peuvent proposer que le Texas Hold'em et le Omaha, ainsi le meilleur web site de poker serait un site qui suggest le choix le plus giant de tables pour ces jeux, y compris tournois, tables de money recreation, et freerolls.
payday loans July 17, 2017
Ces derniers commencent à $1 et montent jusqu'à autour de $200.
marketing thailand July 16, 2017
Regardez l'picture ci-dessus sur le côté droit (cliquez pour agrandir), il y a une belle assortment d'avatars.
esta July 16, 2017
Pour utiliser ce logiciel, il faut donc télécharger le gentle, en utilisant un de nos liens direct par exemple, puis il faut l'installer sur votre ordinateur.
infissi pvc roma July 14, 2017
It took me time but then it did occur ultimately.
hire prostitutes July 11, 2017
There are totally different rules for the assorted games and you will have to be sure you know what they're before you play as the principles and the hand values will instantly have an effect on your winnings.
coimbatore jobs July 7, 2017
Pennsylvania's online playing invoice now contains provisions that affect brick-and-mortar casinos, and it might be connected to the state budget.
ass June 27, 2017
Les jeux sont proposés autour de tables à 9 joueurs, shorthanded et en face à face, à chacune des limites proposées.
anabolic supplements June 27, 2017
More than 1.6 million Americans are renewing their protection, while nearly 520,000 are new customers. Enrollment isn't closing till consumers pay their first month's premium.
3d architectural visualizer January 5, 2017
Taking a look at how effectively these have been applied, you may expect to see extra Yahoo properties, like Weather and Tech, coming to Messenger fairly soon.
ד"ר פוליש December 24, 2016
Par exemple, au moment de la rédaction il y avait plus d'une centaine de sit'n gos en cours en NLTH, rien que dans les buy-ins entre 10 et 20 dollars.
omegle videos December 5, 2016
PokerStars est la plus grande salle de poker au monde et ils ont récemment joué leur forty milliardième most important. Mais la taille n'explique pas tout.
diego krischcautzky December 4, 2016
À chaque main, vous recevez deux cartes fermées que vous êtes le seul à voir.
babe December 2, 2016
Les prélèvement (rakes) chez Celebration Poker font partie des plus bas du marché depuis la nouvelle législation des jeux en ligne française (juin 2010).
chat rooms November 19, 2016
En ce qui concerne le poker en ligne, cependant, les lois sont encore en retard de quelques années sur la plupart des pays européens.
boobs October 20, 2016
Bodog Poker est une salle de poker en ligne qui connaît une grande croissance.
Outcall service St. Petersburg October 18, 2016
Il n'y a absolument aucun frais pour participer à nos jeux d'argent fictif.
Free delivery October 16, 2016
It was by one such tournament on PokerStars that Chris Moneymaker received his entry to the 2003 World Collection of Poker He went on to win the principle event, causing shock within the poker world, and beginning the poker increase The 2004 World Series featured three times as many gamers as in 2003.
Free delivery October 15, 2016
When you play poker online it is essential to have a strategy and this implies understanding the fundamentals corresponding to when to wager, examine and fold, as well as extra advanced abilities corresponding to three-bet bluffs and reverse implied odds.
Www.Yelp.Com June 25, 2016
Having read this I thought it was extremely enlightening. I appreciate you finding the time and energy to put this informative article together. I once again find myself spending a significant amount of time both reading and commenting. But so what, it was still worth it!

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI