Exploratory Data Analysis on Dinner Recipes

Posted on May 24, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

To do data analysis, we first need to have data. There is a vast amount of information floating on the internet that if gathered, could be structured and analyzed to extract insight.


With a fast pace living style, nowadays people may not want to spend too much time cooking dinner and shopping grocery, especially during weekdays. Many recipes websites collected a great number of recipes, and nicely categorized them so users could find the dishes they want to cook by main ingredient type (beef, chicken, vegetarian, etc.). However, few of them, if any, allow users to choose a recipe based on the time needed to get the dish ready, and/or the ingredients available at hand. I scrapped the simplyrecipes.com website for their dinner recipes and focused on the following:

  1. performed exploratory analysis on distribution and correlation of time (cooking and preparing) and ingredients with various dish type (beef, chicken, pork, seafood, lamb, and vegetarian)
  2. write a script to recommend the user with recipes based on one or more criteria:
    1. dish type desired
    2. time intended for preparing & cooking
    3. ingredients at hand

Data Analysis

Among the dinner recipes that are labeled with their main ingredients (dish type), chicken dishes ranked the first, followed by vegetarian, beef, and pork. Seafood, pasta, and lamb are less common. However, about 30% of the recipes were not clearly labeled, some have mixed main ingredients, but a majority could have been categorized into existing dish types. As a reminder to the website, lack of labeling may lead to those recipes falling throw the cracks during user searches based on labels.

Exploratory Data Analysis on Dinner Recipes

Prep & Cook Time

Exploratory Data Analysis on Dinner Recipes
Exploratory Data Analysis on Dinner Recipes

In general, prep time for a dish averages around 17 minutes, and cook time averages around 50 minutes。 The distribution of prep time and cook time of a recipe is not correlated (Pearson coefficient 0.0384). However, ANOVA test revealed that the prep/cook time between different type of dishes do differ. By T-test, the prep/cook time could be roughly separated into 3 tiers: seafood is the fasted; followed by pasta, vegetarian, and chicken; while beef, pork, and lamb generally takes the longest to prepare and cook.

Natural language processing packages were used to analyze the ingredients section of each recipe. Non-nouns (adjectives, verbs, etc.) were identified based on part of speech and removed. As well as measurement units, spice, sauce, were removed manually by adding those to the stop words list. Thus, the ingredient section now mainly just reflect items of ingredients that could be analyzed by word cloud.

Word Cloud Data Analysis 

Word cloud analysis of the auxiliary ingredients (say ingredients other than beef for a beef dish, spice and sauce excluded) reveals that onion, pepper, tomato, cheese, and egg are dominating for all types of dishes.

Such information could be used as a reference from many perspectives: a) for customers, if you get those ingredients, you could maximize the number of recipes you could make; b) for stores, consider stock up on those ingredients and put them in an easily accessible place, also considering packing some less popular ingredients together with those most popular ones as a sale package; c) for the recipe website, this could reflect diet preference of America, or might be a sign of lack of diversity in the recipes they collect.

All Recipes Word Cloud

The dominating auxiliary ingredients continue to appear at the top of the list for dominating auxiliary ingredients for each dish type. Although their ranks change around. Apparently lamb is most often cooked with tomato, beef with onion, and seafood with lemon and onion. So if you don’t have access to any recipes at some point, go with those general rules and you probably wouldn’t stray too far.

Beef Dishes Word Cloud

Lamb Dishes Word Cloud

Seafood Dishes Word Cloud

Word Cloud for Vegetarian Dishes

Data Results

            Finally, I wrote an interactive python script to help the user choose recipes based on the criteria provided. A user could choose the type of dish (beef, chicken, pork, seafood, lamb, or vegetarian), the maximum time of prep and cook in minutes, and what ingredients the user has at hand. Each step could be skipped if that is not a concern. As for the ingredients, the script considers it a match if half of the ingredients required by the recipe are matched. After entering all those requirements, the script will output a table of recipes that meets the user’s need, filtering from over 700 recipes in the original database.

Future Developments

  1. The database could be expanded to include other recipe websites, for websites that have nutrition and calorie information, those parameters could be added to the filtering script as well
  2. Export the cleaned database to SQL so other users could conduct queries
  3. Based on analysis of recipe contents and description, a recipe-wiring algorithm could be developed, newly created recipes could be validated by chefs before announced to users
  4. The filtering system could be incorporated with the recipe website and online grocery shopping app, e.g., Amazon fresh, so the user could select next week’s recipes, the app could calculate the amount and kinds of ingredients needed, add those to the shopping list in the recipe website, which connects to the shopping cart of say, Amazon fresh, and place the order, so all ingredients for next week’s dinners would be delivered in 2 hours.

About Author

Lu Yu

Certified data scientist with a Ph.D. in biology and experience in genomic sequencing data analysis. Specialized in machine learning, big data, and deep learning. A detail-oriented and goal-driven researcher that is also organized in project management. Confident in...
View all posts by Lu Yu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI