All the recipes: Scraping the top 20 recipes of allrecipes

Yannick Kimmel
Posted on May 30, 2016

Contributed by Yannick Kimmel. He  is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - Web scraping (due on the 6th week of the program).

Introduction

The internet is a great resource for chiefs and aspiring cooks alike. There are a plethora of cooking videos and recipes. A website that often comes up is allrecipes.com. This is not surprising as allrecipes is the most popular recipe website (source), and has been around since 1997.  Allrecipes is a go to source for the average American. Every year allrecipes releases the top 20 most popular recipes of that year in which is known as the hall of fame recipes.  For my web scraping project, I thought this list would give me insights in food trends in the USA over the last 20 years. Code for the project can be found here.

Hall of fame recipes

Shown below is a screen shot of the overall list of hall of fame recipes by year (page).

halloffame

Clicking on one of the years brings up the list of 20 recipes. The website is heavily visual and has a pinterest like look. This design was developed using  Asynchronous JavaScript and XML (AJAX) code, which makes it difficult to use traditional xpath methods like BeautifulSoup and Scrapy. Therefore, I chose to use Selenium package in Python to scrape allrecipes, which can handle AJAX.

The URLs for the pages on allrecipes are structured in this generic format: allrecipes.com/recipe/[Unique ID number] . For example, recipe site for Outrageous Chocolate Cookies has an unique ID number of 10141 and can be accessed at http://allrecipes.com/recipe/10141. The screenshot below shows the page for the Hall of Fame recipes of 1997. After inspection of the code, we determined that the attribute inside the heart image (boxed in red) for each recipe lists this unique ID number. I then extracted all the unique ID numbers (and therefore, the unique URLs) for all 20 recipes on each yearly page.

1997halloffame

 

The snippet of code below uses the Selenium package in Python to extract the URLs for all the top 20 recipes of a given year.

Extraction of recipe and ingredient details

Shown below are screenshots of a typical recipe page on allrecipes. Highlighted in red boxes are all the data I collected. This includes the recipe title, user rating, number of people who "made it", number of people who reviewed the recipe, the calories per serving of the recipe, the ingredients, preparation time, cook time, and total time.

exrecipe

Snippet of code used to scrape recipe information from a recipe page shown below. The extracted data is saved to mongoDB. One collection was created for all the recipe information and another collection was created for the ingredients.

 

Further python code was needed to wrangle that data into a more clean format. This was performed in python notebook using a Panda dataframe. This included converting all nonstandard DayHourMinute time into minutes and converting the counter units from K's into thousands. Lastly, the data was converted into CSV and exported to R for data visualization.

The (mostly) cleaned data

Scraping 20 recipes over 19 years gives a total of 380 observations.  The data frame of the recipes is shown below. Each column represents a different aspect of recipe and the data can be considered clean at this point and ready for analysis.

Recipes

Screen Shot 2016-05-30 at 8.45.53 PM

The data frame of the ingredients are shown below. The id number is the unique ID number allrecipes uses for each recipe created. This also allows the ingredients and recipes data frames to be joined if ever needed.

The ingredients data frame is shown below. I did not separate the quantity of a given ingredient from the name of the ingredient, so this data frame cannot be considered truly clean. However, in my analysis this was not a problem as I was only interested in the frequency of the ingredients used and not the quantity.

Ingredients

Screen Shot 2016-05-30 at 8.11.35 PM

Correlation among recipe variables

One of my first thoughts in my analysis was to explore how recipes variables correlated with each other. The figure below is a correlation plot of those variables.

corrplotFor the most part, variables do not correlate well, indicating that the features of the recipe website were well designed and were not redundant with each other. One notable exception is that the number of people who reviewed a recipe were well correlated with the number of people who indicated that they had made the dish.

Word clouds

Word clouds were used to visualize the frequency words using R packages tm, snowballC, and wordcloud. The clouds were produced from a corpus of stemmed words with common English words removed (such as 'the', 'a', 'I', etc.). The first word cloud was made from the all the recipe titles. Words such as cooki (stemmed word of cookie), chicken, chocolate, banana, salad, bread, potato, pie, cake, and bake are most frequently used in the top recipe titles.

Recipes

recipes

While for the ingredients word clouds the most commonly used words are measurement words such as cup, teaspoon, and tablespoon, which is a limitation of uncleaned ingredients data frame. The most commonly used non-measurement words include sugar, white, ground, butter, salt, bake, and chop. All are fundamental words in cooking.

Ingredients

ingredients

Declining frequency of baked goods ingredients

Results from the word cloud indicated that sugar, flour, eggs, and butter are some of the most frequently used ingredients from the top 20 recipe lists. From my past cooking experiences, I know those four ingredients are the foundation of baked goods. The figure below plots of the number of times those ingredients are mentioned in the ingredients list per year. It seems like there is a sharp decline in the use baked good ingredients over time and specifically after 2009.

baking

Median calories per serving

Healthiness of food is a major cultural topic in the USA. One metric of the healthiness of food is calories per serving; I wanted to explore this further. The figure below plots the median calories per serving of recipes over time. There is an initial jump in calories in the first few years, this could be a reflection of the company starting up, and after that it seems like the calories drop slowly over time. This could be an increase on the focus of healthy foods.

calories

Increase of olive oil ingredient frequency

With the frequency of butter dropping over time, I wanted to know if there was a fat, like olive oil, that was supplementing it. Shown below, it seems like the use of olive oil over time has been increasing, and this could be because of the reported health benefits of olive oil.

olive oil

Limitations of dataset

In this report, I considered this data to reflect the food trends of the average American over time. I found two aspects that showed the limitation of this assumption and the need to put data into context.

a) The need for allrecipes to innovate

The figure below plots the review count of each of the top 20 recipes over time. In the 2000's, many of the top 20 recipes have thousands of reviews for them by the allrecipes user base. Then after 2010, there is a drop off in review count for the top 20 recipes. Instead of reviews being left in the thousands, only hundreds of comments are being made for their top recipes. Are Americans less interested in sharing their thoughts on food recipes after 2010? Or does this figure represent less interest in the allrecipes users on leaving comments specifically on the allrecipes website? It maybe the latter rather than former assumption. In 2015, allrecipes announced that they were redesigning the website to make it more of pinterest like website (geekwire). Pinterest is designed around users sharing information to other users.

 

Review count

review count

b) The cookie trail

Another word that I noticed in my word cloud that was  frequently used was cookie. I wanted to understand how the trends of interest in cookies changed over time, so the figure below shows the number of recipes that have cookie in them every year. In 1997, 14/20 top 20 recipes had the word cookie in them, and over time there is a great drop off in the interest of cookies. I had heard anecdotally that cookie exchanges were really popular in the late 1990s so this did not surprise me. But when I looked at the other 7 recipes in 1997 that did not have the word cookie in them, it turns out that they were also about cookies even if they did not specifically mention the word 'cookie' (e.g., recipe for snickerdoodles). So all 20 recipes in 1997 were about cookies, and this made me suspicious. I researched the history of allrecipes.com (wikipedia article), and found out that it was started in 1997 as cookierecipe.com. This explains why all of the top 1997 recipes are about cookies. I also learned the limits of assuming the history of allrecipes.com reflects the cooking habits of Americans, at least when the website was only known for cookies.

Number of recipes that mention cookie in recipe title

cookie

Conclusions

In conclusion, my web scraping of the top 20 recipes on allrecipes.com was successful. There was some interesting trends in the use of ingredients over time. However, with only 380 observations, it was difficult to interpret trends that maybe present with a larger dataset.

About Author

Yannick Kimmel

Yannick Kimmel

Yannick is drawn to solving a wide range of problems - from the traditional sciences to current challenges in data science and machine learning. Yannick holds a PhD in chemical engineering from the University of Delaware, and a...
View all posts by Yannick Kimmel >

Related Articles

Leave a Comment

Avatar
Laurie June 4, 2017
I really hate the new layout, it's one of the reasons I've never used Pinterest. After the redesign I no longer use Allrecipes either, and it was the #1 site I would look for new recipes on before. I know I'm not alone in this. That may explain some of the declining comments, just declining membership in general.
Avatar
bague cartier or rose et diamant copie March 10, 2017
cartierlovejesduas Cheers that does look like a refreshing cocktail; Toronto is finally getting warm temperatures, today we are enjoying 24°C with sunshine! Yay! Have a great weekend. bague cartier or rose et diamant copie http://www.bestcalove.ru/fr/the-fashion-replica-cartier-love-ring-white-gold-316l-titanium-steel-b4084700-p779/
Avatar
All Recipes October 13, 2016
excellent points altogether, you just gained a brand new reader. What might you suggest about your submit that you simply made some days in the past? Any sure?
Avatar
AllRecipes September 28, 2016
I know this site presents quality depending posts and extra material, is there any other site which offers these kinds of things in quality?
Avatar
boucles d'oreilles alhambra van cleef arpels or rose réplique September 7, 2016
I didnt think jews would be bigots afterall they have been through, maybe love for all was taught by one jew named Jesus. Im not a devout christian or traditional but the message of Jesus and symbolism of Jesus and his teachings you cant go wrong following. Of course im not talking of all jews in general but the ones who talk about everyone non jewish when they leave, i truly believe even the jew friends ive had not strict always show some jewness a rude mean condecending side somewhere that is comparable to a nazi. boucles d'oreilles alhambra van cleef arpels or rose réplique http://www.vancleefalhambra.com/fr/cheap-vintage-alhambra-long-necklace-malachite-vcarl88100-p229.html
Avatar
faux bague de fiancaille van cleef June 27, 2016
cartierbraceletlove Ver A Bolha quase 60 anos depois de sua filmagem decepciona um pouco. Os efeitos ficaram datados e sem graça. Mas o que mais incomoda no filme são as péssimas atuações, praticamente todos estão frios e não conseguem sequer passar emoção,talvez por conta de uma fraca direção. faux bague de fiancaille van cleef http://www.collanaqualitagioielli.cn/fr/

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp