How do NYC restaurant health inspection results vary by location and cuisine?
Contributed by Ho Fai Wong. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his first class project - R visualization (due on the 2nd week of the program).
I. Introduction
NYC's Department of Health and Mental Hygiene (DOHMH) conducts unannounced inspections of restaurants at least once a year to check food handling, food temperature, personal hygiene and vermin control. Since 2010, NYC restaurants have to prominently post their Grade (e.g. A/B/C) which empowers diners with decision-making information and incentivizes establishments to improve their hygiene.
I was interested in how a restaurant's location and type of cuisine could affect its inspection results, in order to be a better-informed diner in New York City. For example, some of my initial questions were:
- Are Manhattan restaurants cleaner than those in Queens or the Bronx?
- Do restaurants in Chinatown and Flushing have worse scores than those in the Upper East Side?
- Do Chinese restaurants perform worse in health inspections than French or American restaurants?
The data exploration and visualization was conducted in R. The code can be found on Github.
II. Initial Data Preparation
Load and cleanup
For this exploratory analysis, I used the NYC DOHMH Restaurant Inspection Results as of April 13, 2016, which contained inspection results from 2010 onwards.
The original data was in the form of a single table containing, at a high level, restaurants (incl. zipcode and cuisine), inspection dates and individual violations. Since I was more interested in restaurant grades, I removed the violation-related information and deduplicated the data to obtain unique inspections (~150k inspections i.e. rows covering ~25k restaurants).
The code below was used to load the data and perform some initial cleanup such as:
- Change formats (e.g. date, factor, string)
- Shorten text values
- Add columns for future analysis
- Fix some data issues
At this stage, 2 things are worth noting:
- Created "New Grade": Each violation gets a certain number of points; at the end of an inspection, the total number of points is the restaurant's inspection score—the lower the score, the better the Grade (e.g. A/B/C). However, not all inspections with a score lead to a Grade (see NYC's grading process for details). For simplicity, I created a new grade variable based on the inspection score and using the same ranges as the official grades
- Focused on inspections with valid scores: Not all inspections lead to scores, such as inspections with a type of "Administrative", "Calorie Posting, "Smoke Free Air Act" or "Trans Fat". 1039 restaurants didn't have any inspections at all. 54 inspections led to negative scores, which shouldn't be the case based on the scoring process. For this analysis, and based on these observations, I focused on inspections that led to a null or positive score.
https://gist.github.com/hofaiwong/bf4b91ed4d364635298c20f0255af877
Intermediate analysis tables
I created intermediate tables that were reused throughout the analysis
- Unique inspections from 2010-2016
- Latest inspection for each restaurant
- Unique list of restaurants (their unique ID is 'camis' in the data)
https://gist.github.com/hofaiwong/7d84d35d604e8aa0b727cdcf0a999b83
III. Grades and scores by location
Grades by borough
Since most New Yorkers are more familiar with the letter grades, I started the data exploration by visualizing the number of restaurants by borough and by their latest grade in the bar plot below. A few observations could be made at this point:
- Manhattan had the largest number of restaurants compared to other boroughs, which had to be taken into account moving forward to not overshadow trends in other boroughs
- Most restaurants had an A grade across the 5 boroughs
- The distribution of grades for each borough seemed relatively similar, so I dug deeper to find potential differentiators
https://gist.github.com/hofaiwong/24f9817382d2e400cd7a6c5ce7c322d5
Scores by borough
Since grades couldn't differentiate boroughs, I plotted restaurants by scores instead and used a density plot to account for the disparity in number of restaurants by borough. Immediately, a few points became apparent:
- Most restaurants obtained a grade of A but fewer obtained a low score within the A range. Most restaurants seem to aim for an A grade since they are legally obligated to share their grade with the public (and can avoid fines in some scenarios), but have less incentive to further improve hygiene once they have achieved the A grade
- There was a much sharper drop between the A/B grades than B/C grades which seems understandable since most restaurants want to obtain an A grade, and having a B grade is already a setback
- The distribution of restaurants by score was very similar for all boroughs, so I needed to keep digging deeper for differentiators
https://gist.github.com/hofaiwong/67aea5f8ecb58998336a28e147cbea89
Scores by zipcode
Since comparing grades and scores by borough didn't lead to any notable differentiation, I looked at the average scores by zipcode using the zip_choroplethr package. This view helped answer some of the initial questions I had (e.g. Flushing restaurants have a higher average score i.e. worse hygiene than UES) and could differentiate different neighborhoods, though most restaurant scores fell within the 9-12 range.
https://gist.github.com/hofaiwong/66512ef247416ad2b25e41a45fc8fd0d
IV. What about inspection closures?
Scores don't actually tie directly to restaurant closures. For example, a restaurant could theoretically have only 1 violation with a total inspection score under 13 which would give it an A grade, but that violation could be a public health hazard leading to the restaurant's closure.
To explore inspection closures further, I defined the following ratios:
- Inspection closure ratio: Percentage of inspections that lead to the restaurant being closed
- Repeat closure ratio: Percentage of restaurants that were closed during more than one inspection cycle
Closures by borough
I first calculated the inspection and repeat closure ratios by borough:
- While the inspection closure ratios ranged from ~1% to 2%, there was a wider range for the repeat closure ratio (~6% to 11%)
- Brooklyn had the highest ratio of repeat closures while the Bronx had the highest ratio of inspection closures
https://gist.github.com/hofaiwong/35090ec7126abc056e7200028cbc223f
Closures by cuisine
Similarly, I wanted to visualize the closure ratios by type of cuisine. The original data categorized restaurants into 84 cuisine types. For simplicity, I focused on the top 20 types which covered 80%+ of NYC restaurants and filtered down the data accordingly.
Once again, while most cuisines have average scores in the 9-12 range which would give them an A grade, the closure ratios differentiate cuisines. For example, Chinese restaurants have higher inspection closure ratios and repeat closure ratios than French and American restaurants as I initially hypothesized. Many more comparisons and observations can be made from this scatterplot depending on your selected cuisine type!
https://gist.github.com/hofaiwong/4558aa4a772685f8a8947af9bca41915
Closures by cuisine and borough
Finally, what if we combined both dimensions of location and cuisine? Intuitively, certain cuisines could fare better or worse in health inspections depending on the neighborhood. To illustrate this, I used faceted bar plots of inspection closure ratios by borough with refactored cuisine types to order by descending counts of restaurants.
Once again, many observations can be made. For example, of the top 20 cuisines in Manhattan, Chinese restaurants have the highest inspection closure ratio. Also of note, Asian restaurants in the Bronx and Staten Island have an alarmingly high inspection closure ratio!
https://gist.github.com/hofaiwong/e55bf4a32fa08cd5c565465eee38cfa9
V. Conclusion
Findings
- Displaying scores in addition to grades could further improve hygiene (most have A already)
- My specific initial questions were addressed...
- Scores differentiate neighborhoods but not boroughs; closure ratios provide additional comparisons
- Chinatown is on par with UES but Flushing is not
- Chinese restaurants have higher closure ratios than French and American restaurants
- ... but more observations can be made based on your culinary and geographic selections, e.g:
- Brooklyn and the Bronx have the worst rates of inspection and repeat closures respectively
- Asian restaurants in the Bronx and Staten Island have a much higher inspection closure ratio, etc
Further analysis
- Investigate violation types: do certain neighborhoods or type of restaurants have recurring types of violations?
- Compare with NYC demographic data by neighborhood
- Analyze trends over time by neighborhood and cuisine
- Correlate with popularity of restaurants