How do NYC restaurant health inspection results vary by location and cuisine?

Posted on Apr 28, 2016

Contributed by Ho Fai Wong. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his first class project - R visualization (due on the 2nd week of the program).

I. Introduction

NYC's Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants at least once a year to check food handling, food temperature, personal hygiene and vermin control. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.

I was interested in how a restaurant's location and type of cuisine could affect its inspection results, in order to be a better-informed diner in New York City. For example, some of my initial questions were:

  • Are Manhattan restaurants cleaner than those in Queens or the Bronx?
  • Do restaurants in Chinatown and Flushing have worse scores than those in the Upper East Side?
  • Do Chinese restaurants perform worse in health inspections than French or American restaurants?

The data exploration and visualization was conducted in R. The code can be found on Github.

II. Initial Data Preparation

Load and cleanup

For this exploratory analysis, I used the NYC DOHMH Restaurant Inspection Results as of April 13, 2016, which contained inspection results from 2010 onwards.

The original data was in the form of a single table containing, at a high level, restaurants (incl. zipcode and cuisine), inspection dates and individual violations. Since I was more interested in restaurant grades, I removed the violation-related information and deduplicated the data to obtain unique inspections (~150k inspections i.e. rows covering ~25k restaurants).

The code below was used to load the data and perform some initial cleanup such as:

  • Change formats (e.g. date, factor, string)
  • Shorten text values
  • Add columns for future analysis
  • Fix some data issues

At this stage, 2 things are worth noting:

  • Created "New Grade": Each violation gets a certain number of points; at the end of an inspection, the total number of points is the restaurant's inspection score—the lower the score, the better the Grade (e.g. A/B/C). However, not all inspections with a score lead to a Grade (see NYC's grading process for details). For simplicity, I created a new grade variable based on the inspection score and using the same ranges as the official grades
  • Focused on inspections with valid scores: Not all inspections lead to scores, such as inspections with a type of "Administrative", "Calorie Posting, "Smoke Free Air Act" or "Trans Fat". 1039 restaurants didn't have any inspections at all. 54 inspections led to negative scores, which shouldn't be the case based on the scoring process. For this analysis, and based on these observations, I focused on inspections that led to a null or positive score.

Intermediate analysis tables

I created intermediate tables that were reused throughout the analysis

  • Unique inspections from 2010-2016
  • Latest inspection for each restaurant
  • Unique list of restaurants (their unique ID is 'camis' in the data)

III. Grades and scores by location

Grades by borough

Since most New Yorkers are more familiar with the letter grades, I started the data exploration by visualizing the number of restaurants by borough and by their latest grade in the bar plot below. A few observations could be made at this point:

  • Manhattan had the largest number of restaurants compared to other boroughs, which had to be taken into account moving forward to not overshadow trends in other boroughs
  • Most restaurants had an A grade across the 5 boroughs
  • The distribution of grades for each borough seemed relatively similar, so I dug deeper to find potential differentiators


Scores by borough

Since grades couldn't differentiate boroughs, I plotted restaurants by scores instead and used a density plot to account for the disparity in number of restaurants by borough. Immediately, a few points became apparent:

  • Most restaurants obtained a grade of A but fewer obtained a low score within the A range. Most restaurants seem to aim for an A grade since they are legally obligated to share their grade with the public (and can avoid fines in some scenarios), but have less incentive to further improve hygiene once they have achieved the A grade
  • There was a much sharper drop between the A/B grades than B/C grades which seems understandable since most restaurants want to obtain an A grade, and having a B grade is already a setback
  • The distribution of restaurants by score was very similar for all boroughs, so I needed to keep digging deeper for differentiators

Screen Shot 2016-04-29 at 5.33.51 PM

Scores by zipcode

Since comparing grades and scores by borough didn't lead to any notable differentiation, I looked at the average scores by zipcode using the zip_choroplethr package. This view helped answer some of the initial questions I had (e.g. Flushing restaurants have a higher average score i.e. worse hygiene than UES) and could differentiate different neighborhoods, though most restaurant scores fell within the 9-12 range.


IV. What about inspection closures?

Scores don't actually tie directly to restaurant closures. For example, a restaurant could theoretically have only 1 violation with a total inspection score under 13 which would give it an A grade, but that violation could be a public health hazard leading to the restaurant's closure.

To explore inspection closures further, I defined the following ratios:

  • Inspection closure ratio: Percentage of inspections that lead to the restaurant being closed
  • Repeat closure ratio: Percentage of restaurants that were closed during more than one inspection cycle

Closures by borough

I first calculated the inspection and repeat closure ratios by borough:

  • While the inspection closure ratios ranged from ~1% to 2%, there was a wider range for the repeat closure ratio (~6% to 11%)
  • Brooklyn had the highest ratio of repeat closures while the Bronx had the highest ratio of inspection closures

Screen Shot 2016-04-29 at 5.34.11 PM

Closures by cuisine

Similarly, I wanted to visualize the closure ratios by type of cuisine. The original data categorized restaurants into 84 cuisine types. For simplicity, I focused on the top 20 types which covered 80%+ of NYC restaurants and filtered down the data accordingly.

Once again, while most cuisines have average scores in the 9-12 range which would give them an A grade, the closure ratios differentiate cuisines. For example, Chinese restaurants have higher inspection closure ratios and repeat closure ratios than French and American restaurants as I initially hypothesized. Many more comparisons and observations can be made from this scatterplot depending on your selected cuisine type!


Closures by cuisine and borough

Finally, what if we combined both dimensions of location and cuisine? Intuitively, certain cuisines could fare better or worse in health inspections depending on the neighborhood. To illustrate this, I used faceted bar plots of inspection closure ratios by borough with refactored cuisine types to order by descending counts of restaurants.

Once again, many observations can be made. For example, of the top 20 cuisines in Manhattan, Chinese restaurants have the highest inspection closure ratio. Also of note, Asian restaurants in the Bronx and Staten Island have an alarmingly high inspection closure ratio!

Screen Shot 2016-04-29 at 5.37.40 PM

V. Conclusion


  • Displaying scores in addition to grades could further improve hygiene (most have A already)
  • My specific initial questions were addressed...
    • Scores differentiate neighborhoods but not boroughs; closure ratios provide additional comparisons
    • Chinatown is on par with UES but Flushing is not
    • Chinese restaurants have higher closure ratios than French and American restaurants
  • ... but more observations can be made based on your culinary and geographic selections, e.g:
    • Brooklyn and the Bronx have the worst rates of inspection and repeat closures respectively
    • Asian restaurants in the Bronx and Staten Island have a much higher inspection closure ratio, etc

Further analysis

  • Investigate violation types: do certain neighborhoods or type of restaurants have recurring types of violations?
  • Compare with NYC demographic data by neighborhood
  • Analyze trends over time by neighborhood and cuisine
  • Correlate with popularity of restaurants

About Author


Ho Fai Wong

With a diverse background in computer science and 9 years in Financial Services Technology Consulting, Ho Fai has been applying his analytical, problem-solving, relationship and team management skills at PwC, one of the Big Four consulting firms, focusing...
View all posts by Ho Fai Wong >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp