Data Analysis of New York Restaurant Inspections

Posted on Jun 22, 2022

Github | Presentation | Linkedin

The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Choosing a favorite restaurant in New York City is a joyful task with many possibilities depending on the occasion, mood and even the time of year. Being among the most diverse cities in the world with around 800 languages, you can find any type of restaurant with varieties of cuisines. Different mobile and web application as Yelp, Google Business Reviews and Grubhub are often a starting research point with enough data for many as it allows them to get an idea of other restaurant goers' experience at each restaurant.

In order to operate every restaurant is graded and has to pass special program inspection conducted by NYC every year which is also one important aspect that many restaurant goers will consider prior to eating at a restaurant. However, not that many applications that offer restaurant search offer more details on restaurant health over time.

Problem Statement & Motivation

There are two major problems identified as a part of this research:

  • Lack of sanitary based restaurant ranking
  • Existing applications rank restaurants based on mostly anonymous user feedback

The motivation of this research is to analyze NYC inspection data to identify best restaurant in the best borough, better understand overall restaurants health and analyze one of the many potential user journeys.

Tools Used

Data Analysis of New York Restaurant Inspections

The technology stack for my project is composed of:

  • MariaDB, which is a fork of MySQL, for data storage and filtering
  • Python for data processing
  • And Pandas and PyPlot for data visualization

In addition to this I replicated the views generated in Python in google sheets so they are easier to integrate and view in the google slides presentation from the link above as for the purposes of this blog post.

About Data Set

For the Python Analysis Project, I chose NYC Restaurants Health Inspection Open Data Source provided by NYC Open Data.

The dataset contains every sustained or not yet adjudicated violation citation from every full inspection for restaurants in an active status on the record date (date of the data pull). Establishments are uniquely identified by their CAMIS (unique ID) number. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection.

For the purposes of this research all establishments with inspection date of 1/1/1900 were excluded from the dataset.

Key Overview:

  • Update Frequency: Daily
  • Number of records: 301,194
  • Agency: Department of Health and Mental Hygiene (DOHMH)
  • Each row is a: Restaurant Citations

In the table below we have major variables I used in this research.

Column Name Column Description
CAMIS Unique identifier for the establishment (restaurant)
DBA Establishment (restaurant) name
BORO Borough of establishment (restaurant) location
CUISINE DESCRIPTION Establishment (restaurant) cuisine
INSPECTION DATE Date of Inspection
SCORE Total score for a particular inspection
GRADE Grade associated with the inspection
INSPECTION TYPE A combination of the inspection program and the type of inspection performed
Major Variables

Overview & Interesting Facts

TheΒ New York City Restaurant Inspection ResultsΒ dataset containsΒ 301,194 records and after cleaning data set there are around 19,366 restaurants in all five boroughs. The smallest borough in area, Manhattan, has 7302 restaurants, the most cuisine variations and the highest numbers of inspections respectively. Brooklyn has 5026 restaurants, Queens 4540, Bronx 1747 and Staten Island 721.

Data Analysis of New York Restaurant Inspections

When analyzing overall number of restaurants in five borough it is easy to make a conclusion that Manhattan is the easiest place for New Yorkers to find diverse food offerings.

Almost beating Staten Island which has the highest health score, Manhattan also has very high health score. But what is the favorite cuisine in all five boroughs?

Data Analysis of New York Restaurant Inspections

We can see that the American cuisine is the favorite type of cuisine in New York City. While some boroughs are influenced by specific culture, we can conclude based on the data that in most of the boroughs American and Chinese cuisine are very popular, including pizza and coffee and tea.

Now when we know what cuisine type has the most records, let's see the average weighed score per cuisine type.

Data Analysis of New York Restaurant Inspections

If we read the chart from left to right, we can see that the top two cousins with the lowest score are: Portuguese and Bangladeshi while if we read chart from right to left, we can see that cuisines with the highest average weighed scores are American and Scandinavian.

Individual Case Data Processing & Main Takeaways - Pt. 1

Data Analysis of New York Restaurant Inspections

Because the dataset I was working with was containing in excess of 300,000 rows of data, I had to categorize and filter the data logically. This meant that I would:

  1. Firstly, rank boroughs by cleanliness grade and identify the borough with the top overall grade
  2. Secondly, I would further grade each cuisine in the fore-mentioned borough by cleanliness grade and identify the cleanest cuisine to dine
  3. Lastly, I would rank the top restaurants in the fore-mentioned borough and cuisine by their cleanliness grade and finally choose the cleanest restaurant to dine in

The first step is to rank all five boroughs by their average cleanliness score for each year over the 5 years of data that we have. I have visualized this data set as column chart showing what grade each borough got each year and a table showing the maximum average score for each year and the corresponding borough.

We can clearly see that all years, save for 2020, Staten Island got the highest average grade.

The second step is to identify the cleanest cuisines by average ranked score and includes to

  • Establish a trend of average score per cuisine type in the selected borough, group by year, with the best health score, rank by their overall average health score descending
  • Based on the average trend, choose a cuisine from the previously selected borough with the best health score trend

Individual Case Data Processing & Main Takeaways - Pt. 2

Because this data set was not only composed of multiple cuisines, which would or would not appear in all years from 2018 to 2022, we had to grade the score with two different metrics: the average score the cuisine received on a certain year and how many times the specific cuisine was inspected said year.

By reading the bubble chart displayed above we can determine that not only has frozen desserts received on average excellent scores each year, but they were also inspected the most year-over-year.

The third step is to identify the cleanest restaurants by average ranked score from Staten Island that falls in the Frozen Desserts cuisine type, I did this by looking at

  • The average scores the restaurant received on a certain year and
  • How many times the specific restaurant was inspected said year

Again, by reading the bubble chart we can clearly determine that restaurant Carvel not only has a very high over all average score, but it was also inspected the most out of all restaurants in the frozen desserts cuisine on Staten Island by the size of the blue bubbles.

Finally, we will take a deeper dive in the data that we have for the restaurant Carvel to better understand its average ranking and cumulative score curve over time. As we can see, Carvel has received a score of 90 or higher for every year and a cumulative score of 450 from 2018 to 2022. The cumulative score line curve is positive with no sudden drop in the curve.Β 

We can also determine that the restaurant has received the lowest score of 90 in 2022 and highest of 97 in 2020 during the height of the pandemic.

Future Work

  • Integrate with additional 3rd party information from Yelp
  • Perform more complex analysis combining ratings from the NYC public data set and the 3rd party reviews
  • Develop a website to allow users to query and view this data analysis

Stay tuned!

About Author

Adna Lakisic

I am a Data Scientist in the making with background in software engineering.
View all posts by Adna Lakisic >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI