What's for Dinner? - a web scraping project

Posted on May 24, 2020

To do data analysis, we first need to have data. There is a vast amount of information floating on the internet that if gathered, could be structured and analyzed to extract insight.


With a fast pace living style, nowadays people may not want to spend too much time cooking dinner and shopping grocery, especially during weekdays. Many recipes websites collected a great number of recipes, and nicely categorized them so users could find the dishes they want to cook by main ingredient type (beef, chicken, vegetarian, etc.). However, few of them, if any, allow users to choose a recipe based on the time needed to get the dish ready, and/or the ingredients available at hand. I scrapped the simplyrecipes.com website for their dinner recipes and focused on the following:

  1. performed exploratory analysis on distribution and correlation of time (cooking and preparing) and ingredients with various dish type (beef, chicken, pork, seafood, lamb, and vegetarian)
  2. write a script to recommend the user with recipes based on one or more criteria:
    1. dish type desired
    2. time intended for preparing & cooking
    3. ingredients at hand


Among the dinner recipes that are labeled with their main ingredients (dish type), chicken dishes ranked the first, followed by vegetarian, beef, and pork. Seafood, pasta, and lamb are less common. However, about 30% of the recipes were not clearly labeled, some have mixed main ingredients, but a majority could have been categorized into existing dish types. As a reminder to the website, lack of labeling may lead to those recipes falling throw the cracks during user searches based on labels.

In general, prep time for a dish averages around 17 minutes, and cook time averages around 50 minutes。 The distribution of prep time and cook time of a recipe is not correlated (Pearson coefficient 0.0384). However, ANOVA test revealed that the prep/cook time between different type of dishes do differ. By T-test, the prep/cook time could be roughly separated into 3 tiers: seafood is the fasted; followed by pasta, vegetarian, and chicken; while beef, pork, and lamb generally takes the longest to prepare and cook.

Natural language processing packages were used to analyze the ingredients section of each recipe. Non-nouns (adjectives, verbs, etc.) were identified based on part of speech and removed. As well as measurement units, spice, sauce, were removed manually by adding those to the stop words list. Thus, the ingredient section now mainly just reflect items of ingredients that could be analyzed by word cloud.

Word cloud analysis of the auxiliary ingredients (say ingredients other than beef for a beef dish, spice and sauce excluded) reveals that onion, pepper, tomato, cheese, and egg are dominating for all types of dishes. Such information could be used as a reference from many perspectives: a) for customers, if you get those ingredients, you could maximize the number of recipes you could make; b) for stores, consider stock up on those ingredients and put them in an easily accessible place, also considering packing some less popular ingredients together with those most popular ones as a sale package; c) for the recipe website, this could reflect diet preference of America, or might be a sign of lack of diversity in the recipes they collect.

The dominating auxiliary ingredients continue to appear at the top of the list for dominating auxiliary ingredients for each dish type. Although their ranks change around. Apparently lamb is most often cooked with tomato, beef with onion, and seafood with lemon and onion. So if you don’t have access to any recipes at some point, go with those general rules and you probably wouldn’t stray too far.


            Finally, I wrote an interactive python script to help the user choose recipes based on the criteria provided. A user could choose the type of dish (beef, chicken, pork, seafood, lamb, or vegetarian), the maximum time of prep and cook in minutes, and what ingredients the user has at hand. Each step could be skipped if that is not a concern. As for the ingredients, the script considers it a match if half of the ingredients required by the recipe are matched. After entering all those requirements, the script will output a table of recipes that meets the user’s need, filtering from over 700 recipes in the original database.

Future Developments

  1. The database could be expanded to include other recipe websites, for websites that have nutrition and calorie information, those parameters could be added to the filtering script as well
  2. Export the cleaned database to SQL so other users could conduct queries
  3. Based on analysis of recipe contents and description, a recipe-wiring algorithm could be developed, newly created recipes could be validated by chefs before announced to users
  4. The filtering system could be incorporated with the recipe website and online grocery shopping app, e.g., Amazon fresh, so the user could select next week’s recipes, the app could calculate the amount and kinds of ingredients needed, add those to the shopping list in the recipe website, which connects to the shopping cart of say, Amazon fresh, and place the order, so all ingredients for next week’s dinners would be delivered in 2 hours.

About Author

Lu Yu

Certified data scientist with a Ph.D. in biology and experience in genomic sequencing data analysis. Specialized in machine learning, big data, and deep learning. A detail-oriented and goal-driven researcher that is also organized in project management. Confident in...
View all posts by Lu Yu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp