San Francisco Restaurant Inspection Analysis and Visualization

Hans Lau
Posted on Oct 15, 2017


Starting 2012, jurisdictions across the country including San Francisco have begun publishing health inspection scores on Yelp using a standardized scoring system called LIVES. This open data allowed restaurant consumers to make informed decisions based on where they want to eat and motivated a lot of restaurant establishments to improve their inspection score in the hopes of attracting a bigger customer base. However, the situation for health inspectors was quite different. Health inspectors found themselves not able to keep up with the pace of inspections due to the growing number of restaurant establishments in San Francisco since 2012. As a result, many restaurants  were not inspected for years. It became a problem since some of those restaurants are known to commit high risk violations. If they are not inspected at least twice a year (standard procedure), the risk for food-related illnesses will increase dramatically.  In 2014 there were only 30 health inspectors  for at least 4500 restaurants in SF and one third of  those restaurants had committed high risk health violations before.

Breakdown Questions

An interactive application was developed to provide a possible solution for the growing risk of food-related illnesses due to uninspected restaurants. The following questions were addressed:

  1. Trend Analysis: What is the overall trend for the aggregate number of different risk violations from 2014 to 2016?
  2. Location Specific Contributions: Where are the specific regions in San Francisco that have the most high risk violations?  How much do those regions contribute in terms of total high risk violations in San Francisco?
  3. Identifying Individual Restaurants : Which top 20 restaurants within those regions are committing the most high risk violations?

The app is intended to provide the user a comparative measure of different types of violations committed at a daily, monthly, or yearly interval and reveal dense regions in San Francisco containing high risk violations.

Data Overview

The data was provided through  San Francisco Health Department via the San Francisco Open Data Portal - downloadable files of San Francisco Restaurant Inspection Scores data. There were approximately 50 thousand observations and 17 features. The features includes:

  • Business ID
  • Business Name
  • Business Address
  • Business City
  • Business State
  • Business Postal Code
  • Business Latitude
  • Business Longitude
  • Business Location
  • Business Phone Number
  • Inspection ID
  • Inspection Date
  • Inspection Score( 0-100)
  • Inspection Type
  • Violation ID
  • Violation Description
  • Violation Risk Category ( No Risk, Low Risk, Moderate Risk, High Risk)

Trend Analysis

From the time series above, the aggregate number of low risk violations  in San Francisco appear to be the highest among all the types of risk violations committed throughout 2014 and 2016. Specifically, on June 3, 2014, the number of low risks violations was 72, while moderate, high, and no risk violations had much fewer counts than that.

The time series help provide a macro and micro perspective of the inspection performance of restaurants in San Francisco. In the case for San Francisco health department, they can use this particular visualization and its interactive features to either discover the general trend of violations committed or specify a month and day to find the respective violation counts.  Moreover,the health department would be able to anticipate when a problem arise. For example,  in a scenario where total high risk violations is greater than low, moderate, or no risk violations for a certain period of time, the health department can plan beforehand and act accordingly to prevent a possible food- related crisis.

Location Specific Contributions

One of the objectives of the health department is to re-inspect restaurants that failed their last inspection due to the number of high risk violations committed. Therefore, the following heat map only consider high risk violations. The heat map reveals high risk violation counts for each post code.  Seven postal codes (94133, 94103, 94109, 94110, 94102, 94122, and 94108)  contributed 52% of the total high risk violations in San Francisco. Upon inspection, those postal codes correspond to some of the busiest areas in  San Francisco including Chinatown, Pier and most of downtown San Francisco where the number of restaurants is much higher relative to other postal codes.  Further statistical tests would reveal more insights on the correlation between number of restaurants and the number of high risk violations and other types of risk violations.

Similar to the heat map, the bar plot above depicts the number of  risk violations by individual postal codes in San Francisco. The main difference is that the bar plot provides a visual comparison between all different types of risk violation  in various postal codes. Analysis of the bar plot reveal that postal codes which have the greatest high risk violations also have the greatest low and moderate risk violations. One possible explanation for this finding was mentioned above. Another explanation is that there may be interaction effects within each type of risk violations.  Statistical analysis would test the validity of the hypothesis that as the number of restaurants increase, the number of risk violations committed also increase. It could also verify the existence of possible relationships between different type of risk violations.

Identifying Individual Restaurants

After narrowing down the top 7 postal codes which contribute most to high risk violations, individual restaurants were analyzed to see which restaurants within each of these postcodes contribute the most for the  number of high risk violations. For postcode 94133, one restaurant had  10 high risk violations; the most of any single restaurant among the 167  in this postcode. In fact, the top 20 restaurants contributed 28% of high risk violations in postcode 94133.



San Francisco health department should allocate a larger fraction of existing health inspectors to post codes with high density of high risk violations. This approach would ultimately decrease the growing risk of food-related incidents by placing heavier attention on restaurants that commit more high risk violations. Of course, hiring more health inspectors  would equally solve the problem and decrease the overall time spent for inspection in San Francisco. However, if the San Francisco health department is under a tight financial situation or under other circumstances which prevent the hiring of more health inspectors, this short term allocation solution will suffice. An interesting follow up project would highlight the specific details of the approach, including the  number of health inspectors to send to respective postal codes,  the time and date to inspect restaurants that optimize time effectiveness, and the inspection order of restaurants. This could be done by building a  predictive model to determine which restaurants are more likely to commit high risk violations first.



The application could be found on and respective code could be found here.

About Author

Hans Lau

Hans Lau

Hans graduated from University of Illinois Urbana-Champaign with a major in Bioengineering and a minor in Computer Science. He is passionate in the applications of data science especially in medical technology and biological systems. In his downtime, he...
View all posts by Hans Lau >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp