San Francisco Restaurant Data Analysis and Visualization
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Starting 2012, jurisdictions across the country including San Francisco have begun publishing health inspection scores on Yelp using a standardized scoring system called LIVES. This open data allowed restaurant consumers to make informed decisions based on where they want to eat and motivated a lot of restaurant establishments to improve their inspection score in the hopes of attracting a bigger customer base.
However, the situation for health inspectors was quite different. Health inspectors found themselves not able to keep up with the pace of inspections due to the growing number of restaurant establishments in San Francisco since 2012.
As a result, many restaurants were not inspected for years. It became a problem since some of those restaurants are known to commit high risk violations. If they are not inspected at least twice a year (standard procedure), the risk for food-related illnesses will increase dramatically. In 2014 there were only 30 health inspectors for at least 4500 restaurants in SF and one third of those restaurants had committed high risk health violations before.
An interactive application was developed to provide a possible solution for the growing risk of food-related illnesses due to uninspected restaurants. The following questions were addressed:
- Trend Analysis: What is the overall trend for the aggregate number of different risk violations from 2014 to 2016?
- Location Specific Contributions: Where are the specific regions in San Francisco that have the most high risk violations? How much do those regions contribute in terms of total high risk violations in San Francisco?
- Identifying Individual Restaurants : Which top 20 restaurants within those regions are committing the most high risk violations?
The app is intended to provide the user a comparative measure of different types of violations committed at a daily, monthly, or yearly interval and reveal dense regions in San Francisco containing high risk violations.
The data was provided through San Francisco Health Department via the San Francisco Open Data Portal - downloadable files of San Francisco Restaurant Inspection Scores data. There were approximately 50 thousand observations and 17 features. The features includes:
- Business ID
- Business Name
- Business Address
- Business City
- Business State
- Business Postal Code
- Business Latitude
- Business Longitude
- Business Location
- Business Phone Number
- Inspection ID
- Inspection Date
- Inspection Score( 0-100)
- Inspection Type
- Violation ID
- Violation Description
- Violation Risk Category ( No Risk, Low Risk, Moderate Risk, High Risk)
Trend Data Analysis
From the time series above, the aggregate number of low risk violations in San Francisco appear to be the highest among all the types of risk violations committed throughout 2014 and 2016. Specifically, on June 3, 2014, the number of low risks violations was 72, while moderate, high, and no risk violations had much fewer counts than that.
The time series help provide a macro and micro perspective of the inspection performance of restaurants in San Francisco. In the case for San Francisco health department, they can use this particular visualization and its interactive features to either discover the general trend of violations committed or specify a month and day to find the respective violation counts. Moreover,the health department would be able to anticipate when a problem arise. For example, in a scenario where total high risk violations is greater than low, moderate, or no risk violations for a certain period of time, the health department can plan beforehand and act accordingly to prevent a possible food- related crisis.
Location Specific Contributions
One of the objectives of the health department is to re-inspect restaurants that failed their last inspection due to the number of high risk violations committed. Therefore, the following heat map only consider high risk violations. The heat map reveals high risk violation counts for each post code. Seven postal codes (94133, 94103, 94109, 94110, 94102, 94122, and 94108) contributed 52% of the total high risk violations in San Francisco.
Upon inspection, those postal codes correspond to some of the busiest areas in San Francisco including Chinatown, Pier and most of downtown San Francisco where the number of restaurants is much higher relative to other postal codes. Further statistical tests would reveal more insights on the correlation between number of restaurants and the number of high risk violations and other types of risk violations.
Similar to the heat map, the bar plot above depicts the number of risk violations by individual postal codes in San Francisco. The main difference is that the bar plot provides a visual comparison between all different types of risk violation in various postal codes. Analysis of the bar plot reveal that postal codes which have the greatest high risk violations also have the greatest low and moderate risk violations.
One possible explanation for this finding was mentioned above. Another explanation is that there may be interaction effects within each type of risk violations. Statistical analysis would test the validity of the hypothesis that as the number of restaurants increase, the number of risk violations committed also increase. It could also verify the existence of possible relationships between different type of risk violations.
Identifying Data of Individual Restaurants
After narrowing down the top 7 postal codes which contribute most to high risk violations, individual restaurants were analyzed to see which restaurants within each of these postcodes contribute the most for the number of high risk violations. For postcode 94133, one restaurant had 10 high risk violations; the most of any single restaurant among the 167 in this postcode. In fact, the top 20 restaurants contributed 28% of high risk violations in postcode 94133.
San Francisco health department should allocate a larger fraction of existing health inspectors to post codes with high density of high risk violations. This approach would ultimately decrease the growing risk of food-related incidents by placing heavier attention on restaurants that commit more high risk violations. Of course, hiring more health inspectors would equally solve the problem and decrease the overall time spent for inspection in San Francisco. However, if the San Francisco health department is under a tight financial situation or under other circumstances which prevent the hiring of more health inspectors, this short term allocation solution will suffice.
An interesting follow up project would highlight the specific details of the approach, including the number of health inspectors to send to respective postal codes, the time and date to inspect restaurants that optimize time effectiveness, and the inspection order of restaurants. This could be done by building a predictive model to determine which restaurants are more likely to commit high risk violations first.