Data Analysis of New York Restaurant Inspections
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Choosing a favorite restaurant in New York City is a joyful task with many possibilities depending on the occasion, mood and even the time of year. Being among the most diverse cities in the world with around 800 languages, you can find any type of restaurant with varieties of cuisines. Different mobile and web application as Yelp, Google Business Reviews and Grubhub are often a starting research point with enough data for many as it allows them to get an idea of other restaurant goers' experience at each restaurant.
In order to operate every restaurant is graded and has to pass special program inspection conducted by NYC every year which is also one important aspect that many restaurant goers will consider prior to eating at a restaurant. However, not that many applications that offer restaurant search offer more details on restaurant health over time.
Problem Statement & Motivation
There are two major problems identified as a part of this research:
- Lack of sanitary based restaurant ranking
- Existing applications rank restaurants based on mostly anonymous user feedback
The motivation of this research is to analyze NYC inspection data to identify best restaurant in the best borough, better understand overall restaurants health and analyze one of the many potential user journeys.
The technology stack for my project is composed of:
- MariaDB, which is a fork of MySQL, for data storage and filtering
- Python for data processing
- And Pandas and PyPlot for data visualization
In addition to this I replicated the views generated in Python in google sheets so they are easier to integrate and view in the google slides presentation from the link above as for the purposes of this blog post.
About Data Set
For the Python Analysis Project, I chose NYC Restaurants Health Inspection Open Data Source provided by NYC Open Data.
The dataset contains every sustained or not yet adjudicated violation citation from every full inspection for restaurants in an active status on the record date (date of the data pull). Establishments are uniquely identified by their CAMIS (unique ID) number. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection.
For the purposes of this research all establishments with inspection date of 1/1/1900 were excluded from the dataset.
- Update Frequency: Daily
- Number of records: 301,194
- Agency: Department of Health and Mental Hygiene (DOHMH)
- Each row is a: Restaurant Citations
In the table below we have major variables I used in this research.
|Column Name||Column Description|
|CAMIS||Unique identifier for the establishment (restaurant)|
|DBA||Establishment (restaurant) name|
|BORO||Borough of establishment (restaurant) location|
|CUISINE DESCRIPTION||Establishment (restaurant) cuisine|
|INSPECTION DATE||Date of Inspection|
|SCORE||Total score for a particular inspection|
|GRADE||Grade associated with the inspection|
|INSPECTION TYPE||A combination of the inspection program and the type of inspection performed|
Overview & Interesting Facts
The New York City Restaurant Inspection Results dataset contains 301,194 records and after cleaning data set there are around 19,366 restaurants in all five boroughs. The smallest borough in area, Manhattan, has 7302 restaurants, the most cuisine variations and the highest numbers of inspections respectively. Brooklyn has 5026 restaurants, Queens 4540, Bronx 1747 and Staten Island 721.
When analyzing overall number of restaurants in five borough it is easy to make a conclusion that Manhattan is the easiest place for New Yorkers to find diverse food offerings.
Almost beating Staten Island which has the highest health score, Manhattan also has very high health score. But what is the favorite cuisine in all five boroughs?
We can see that the American cuisine is the favorite type of cuisine in New York City. While some boroughs are influenced by specific culture, we can conclude based on the data that in most of the boroughs American and Chinese cuisine are very popular, including pizza and coffee and tea.
Now when we know what cuisine type has the most records, let's see the average weighed score per cuisine type.
If we read the chart from left to right, we can see that the top two cousins with the lowest score are: Portuguese and Bangladeshi while if we read chart from right to left, we can see that cuisines with the highest average weighed scores are American and Scandinavian.
Individual Case Data Processing & Main Takeaways - Pt. 1
Because the dataset I was working with was containing in excess of 300,000 rows of data, I had to categorize and filter the data logically. This meant that I would:
- Firstly, rank boroughs by cleanliness grade and identify the borough with the top overall grade
- Secondly, I would further grade each cuisine in the fore-mentioned borough by cleanliness grade and identify the cleanest cuisine to dine
- Lastly, I would rank the top restaurants in the fore-mentioned borough and cuisine by their cleanliness grade and finally choose the cleanest restaurant to dine in
The first step is to rank all five boroughs by their average cleanliness score for each year over the 5 years of data that we have. I have visualized this data set as column chart showing what grade each borough got each year and a table showing the maximum average score for each year and the corresponding borough.
We can clearly see that all years, save for 2020, Staten Island got the highest average grade.
The second step is to identify the cleanest cuisines by average ranked score and includes to
- Establish a trend of average score per cuisine type in the selected borough, group by year, with the best health score, rank by their overall average health score descending
- Based on the average trend, choose a cuisine from the previously selected borough with the best health score trend
Individual Case Data Processing & Main Takeaways - Pt. 2
Because this data set was not only composed of multiple cuisines, which would or would not appear in all years from 2018 to 2022, we had to grade the score with two different metrics: the average score the cuisine received on a certain year and how many times the specific cuisine was inspected said year.
By reading the bubble chart displayed above we can determine that not only has frozen desserts received on average excellent scores each year, but they were also inspected the most year-over-year.
The third step is to identify the cleanest restaurants by average ranked score from Staten Island that falls in the Frozen Desserts cuisine type, I did this by looking at
- The average scores the restaurant received on a certain year and
- How many times the specific restaurant was inspected said year
Again, by reading the bubble chart we can clearly determine that restaurant Carvel not only has a very high over all average score, but it was also inspected the most out of all restaurants in the frozen desserts cuisine on Staten Island by the size of the blue bubbles.
Finally, we will take a deeper dive in the data that we have for the restaurant Carvel to better understand its average ranking and cumulative score curve over time. As we can see, Carvel has received a score of 90 or higher for every year and a cumulative score of 450 from 2018 to 2022. The cumulative score line curve is positive with no sudden drop in the curve.
We can also determine that the restaurant has received the lowest score of 90 in 2022 and highest of 97 in 2020 during the height of the pandemic.
- Integrate with additional 3rd party information from Yelp
- Perform more complex analysis combining ratings from the NYC public data set and the 3rd party reviews
- Develop a website to allow users to query and view this data analysis