San Francisco Restaurant Data Analysis and Visualization

Posted on Oct 15, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


Starting 2012, jurisdictions across the country including San Francisco have begun publishing health inspection scores on Yelp using a standardized scoring system calledĀ LIVES. This open data allowed restaurant consumers to make informed decisions based on where they want to eat and motivated a lot of restaurant establishments to improve their inspection score in the hopes of attracting a bigger customer base.

However, the situation for health inspectors was quite different. Health inspectors found themselves not able to keep up with the pace of inspections due to the growing number of restaurant establishments in San Francisco since 2012.

As a result, many restaurantsĀ  were not inspected for years. It became a problem since some of those restaurantsĀ are known to commit high risk violations. If they are not inspected at least twice a year (standard procedure), the risk for food-related illnesses will increase dramatically.Ā  In 2014 there were only 30 health inspectorsĀ  for at least 4500 restaurants in SF and one third ofĀ  those restaurants had committed high risk health violations before.

Breakdown Questions

An interactive application was developed to provide a possible solution for the growing risk of food-related illnesses due to uninspected restaurants. The following questions were addressed:

  1. Trend Analysis: What is the overall trend for the aggregate number of different risk violations from 2014 to 2016?
  2. Location Specific Contributions: Where are the specific regions in San Francisco that have the most high risk violations?Ā  How much do those regions contribute in terms of total high risk violations in San Francisco?
  3. Identifying Individual RestaurantsĀ : Which top 20 restaurants within those regions are committing the most high risk violations?

The app is intended to provide the user a comparative measure of different types of violations committed at a daily, monthly, or yearly interval and reveal dense regions in San Francisco containing high risk violations.

Data Overview

The data was provided throughĀ Ā San Francisco Health Department via the San Francisco Open Data Portal -Ā downloadable files of San Francisco Restaurant Inspection Scores data. There were approximately 50 thousand observations and 17 features. The features includes:

  • Business ID
  • Business Name
  • Business Address
  • Business City
  • Business State
  • Business Postal Code
  • Business Latitude
  • Business Longitude
  • Business Location
  • Business Phone Number
  • Inspection ID
  • Inspection Date
  • Inspection Score( 0-100)
  • Inspection Type
  • Violation ID
  • Violation Description
  • Violation Risk Category ( No Risk, Low Risk, Moderate Risk, High Risk)

Trend Data Analysis

San Francisco Restaurant Data Analysis and Visualization

From the time series above, the aggregate number of low risk violationsĀ  in San Francisco appear to be the highest among all the types of risk violations committed throughout 2014 and 2016. Specifically, on June 3, 2014, the number of low risks violations was 72, while moderate, high, and no risk violations had much fewer counts than that.

The time series help provide a macro and micro perspective of the inspection performance of restaurants in San Francisco. In the case for San Francisco health department, they can use this particular visualization and its interactive features to either discover the general trend of violations committed or specify a month and day to find the respective violation counts.Ā  Moreover,the health department would be able to anticipate when a problem arise. For example, Ā in a scenario where total high risk violations is greater than low, moderate, or no risk violations for a certain period of time, the health department can plan beforehand and act accordingly to prevent a possible food- related crisis.

Location Specific Contributions

San Francisco Restaurant Data Analysis and VisualizationOne of the objectives of the health department is to re-inspect restaurants that failed their last inspection due to the number of high risk violations committed. Therefore, the following heat map only consider high risk violations. The heat map reveals high risk violation counts for each post code.Ā  Seven postal codes (94133, 94103, 94109, 94110, 94102, 94122, and 94108)Ā  contributed 52% of the total high risk violations in San Francisco.

Upon inspection, those postal codes correspond to some of the busiest areas inĀ  San Francisco including Chinatown, Pier and most of downtown San Francisco where the number of restaurants is much higher relative to other postal codes.Ā  Further statistical tests would reveal more insights on the correlation between number of restaurants and the number of high risk violations and other types of risk violations.San Francisco Restaurant Data Analysis and Visualization

Similar to the heat map, the bar plot above depicts the number ofĀ  risk violations by individual postal codes in San Francisco. The main difference is that the bar plot provides a visual comparison between all different types of risk violationĀ  in various postal codes. Analysis of the bar plot reveal that postal codes which have the greatest high risk violations also have the greatest low and moderate risk violations.

One possible explanation for this finding was mentioned above. Another explanation is that there may be interaction effects within each type of risk violations.Ā  Statistical analysis would test the validity of the hypothesis that as the number of restaurants increase, the number of risk violations committed also increase. It could also verify the existence of possible relationships between different type of risk violations.

Identifying Data of Individual RestaurantsĀ 

After narrowing down the top 7 postal codes which contribute most to high risk violations, individual restaurants were analyzed to see which restaurants within each of these postcodes contribute the most for theĀ  number of high risk violations. For postcode 94133, one restaurant hadĀ  10 high risk violations; the most of any single restaurant among the 167Ā  in this postcode. In fact, the top 20 restaurants contributed 28% of high risk violations in postcode 94133.



San Francisco health department should allocate a larger fraction of existing health inspectors to post codes with high density of high risk violations. This approach would ultimately decrease the growing risk of food-related incidents by placing heavier attention on restaurants that commit more high risk violations. Of course, hiring more health inspectorsĀ  would equally solve the problem and decrease the overall time spent for inspection in San Francisco. However, if the San Francisco health department is under a tight financial situation or under other circumstances which prevent the hiring of more health inspectors, this short term allocation solution will suffice.

An interesting follow up project would highlight the specific details of the approach, including theĀ  number of health inspectors to send to respective postal codes,Ā  the time and date to inspect restaurants that optimize time effectiveness, and the inspection order of restaurants. This could be done by building aĀ  predictive model to determine which restaurants are more likely to commit high risk violations first.



The application could be found on Shiny.ioĀ and respective code could be foundĀ here.

About Author

Hans Lau

Hans graduated from University of Illinois Urbana-Champaign with a major in Bioengineering and a minor in Computer Science. He is passionate in the applications of data science especially in medical technology and biological systems. In his downtime, he...
View all posts by Hans Lau >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI