All the recipes: Scraping the top 20 recipes of allrecipes
Contributed by Yannick Kimmel. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his third class project - Web scraping (due on the 6th week of the program).
The internet is a great resource for chiefs and aspiring cooks alike. There are a plethora of cooking videos and recipes. A website that often comes up is allrecipes.com. This is not surprising as allrecipes is the most popular recipe website (source), and has been around since 1997. Allrecipes is a go to source for the average American. Every year allrecipes releases the top 20 most popular recipes of that year in which is known as the hall of fame recipes. For my web scraping project, I thought this list would give me insights in food trends in the USA over the last 20 years. Code for the project can be found here.
Hall of fame recipes
Shown below is a screen shot of the overall list of hall of fame recipes by year (page).
The URLs for the pages on allrecipes are structured in this generic format: allrecipes.com/recipe/[Unique ID number] . For example, recipe site for Outrageous Chocolate Cookies has an unique ID number of 10141 and can be accessed at http://allrecipes.com/recipe/10141. The screenshot below shows the page for the Hall of Fame recipes of 1997. After inspection of the code, we determined that the attribute inside the heart image (boxed in red) for each recipe lists this unique ID number. I then extracted all the unique ID numbers (and therefore, the unique URLs) for all 20 recipes on each yearly page.
The snippet of code below uses the Selenium package in Python to extract the URLs for all the top 20 recipes of a given year.
Extraction of recipe and ingredient details
Shown below are screenshots of a typical recipe page on allrecipes. Highlighted in red boxes are all the data I collected. This includes the recipe title, user rating, number of people who "made it", number of people who reviewed the recipe, the calories per serving of the recipe, the ingredients, preparation time, cook time, and total time.
Snippet of code used to scrape recipe information from a recipe page shown below. The extracted data is saved to mongoDB. One collection was created for all the recipe information and another collection was created for the ingredients.
Further python code was needed to wrangle that data into a more clean format. This was performed in python notebook using a Panda dataframe. This included converting all nonstandard DayHourMinute time into minutes and converting the counter units from K's into thousands. Lastly, the data was converted into CSV and exported to R for data visualization.
The (mostly) cleaned data
Scraping 20 recipes over 19 years gives a total of 380 observations. The data frame of the recipes is shown below. Each column represents a different aspect of recipe and the data can be considered clean at this point and ready for analysis.
The data frame of the ingredients are shown below. The id number is the unique ID number allrecipes uses for each recipe created. This also allows the ingredients and recipes data frames to be joined if ever needed.
The ingredients data frame is shown below. I did not separate the quantity of a given ingredient from the name of the ingredient, so this data frame cannot be considered truly clean. However, in my analysis this was not a problem as I was only interested in the frequency of the ingredients used and not the quantity.
Correlation among recipe variables
One of my first thoughts in my analysis was to explore how recipes variables correlated with each other. The figure below is a correlation plot of those variables.
For the most part, variables do not correlate well, indicating that the features of the recipe website were well designed and were not redundant with each other. One notable exception is that the number of people who reviewed a recipe were well correlated with the number of people who indicated that they had made the dish.
Word clouds were used to visualize the frequency words using R packages tm, snowballC, and wordcloud. The clouds were produced from a corpus of stemmed words with common English words removed (such as 'the', 'a', 'I', etc.). The first word cloud was made from the all the recipe titles. Words such as cooki (stemmed word of cookie), chicken, chocolate, banana, salad, bread, potato, pie, cake, and bake are most frequently used in the top recipe titles.
While for the ingredients word clouds the most commonly used words are measurement words such as cup, teaspoon, and tablespoon, which is a limitation of uncleaned ingredients data frame. The most commonly used non-measurement words include sugar, white, ground, butter, salt, bake, and chop. All are fundamental words in cooking.
Declining frequency of baked goods ingredients
Results from the word cloud indicated that sugar, flour, eggs, and butter are some of the most frequently used ingredients from the top 20 recipe lists. From my past cooking experiences, I know those four ingredients are the foundation of baked goods. The figure below plots of the number of times those ingredients are mentioned in the ingredients list per year. It seems like there is a sharp decline in the use baked good ingredients over time and specifically after 2009.
Median calories per serving
Healthiness of food is a major cultural topic in the USA. One metric of the healthiness of food is calories per serving; I wanted to explore this further. The figure below plots the median calories per serving of recipes over time. There is an initial jump in calories in the first few years, this could be a reflection of the company starting up, and after that it seems like the calories drop slowly over time. This could be an increase on the focus of healthy foods.
Increase of olive oil ingredient frequency
With the frequency of butter dropping over time, I wanted to know if there was a fat, like olive oil, that was supplementing it. Shown below, it seems like the use of olive oil over time has been increasing, and this could be because of the reported health benefits of olive oil.
Limitations of dataset
In this report, I considered this data to reflect the food trends of the average American over time. I found two aspects that showed the limitation of this assumption and the need to put data into context.
a) The need for allrecipes to innovate
The figure below plots the review count of each of the top 20 recipes over time. In the 2000's, many of the top 20 recipes have thousands of reviews for them by the allrecipes user base. Then after 2010, there is a drop off in review count for the top 20 recipes. Instead of reviews being left in the thousands, only hundreds of comments are being made for their top recipes. Are Americans less interested in sharing their thoughts on food recipes after 2010? Or does this figure represent less interest in the allrecipes users on leaving comments specifically on the allrecipes website? It maybe the latter rather than former assumption. In 2015, allrecipes announced that they were redesigning the website to make it more of pinterest like website (geekwire). Pinterest is designed around users sharing information to other users.
b) The cookie trail
Another word that I noticed in my word cloud that was frequently used was cookie. I wanted to understand how the trends of interest in cookies changed over time, so the figure below shows the number of recipes that have cookie in them every year. In 1997, 14/20 top 20 recipes had the word cookie in them, and over time there is a great drop off in the interest of cookies. I had heard anecdotally that cookie exchanges were really popular in the late 1990s so this did not surprise me. But when I looked at the other 7 recipes in 1997 that did not have the word cookie in them, it turns out that they were also about cookies even if they did not specifically mention the word 'cookie' (e.g., recipe for snickerdoodles). So all 20 recipes in 1997 were about cookies, and this made me suspicious. I researched the history of allrecipes.com (wikipedia article), and found out that it was started in 1997 as cookierecipe.com. This explains why all of the top 1997 recipes are about cookies. I also learned the limits of assuming the history of allrecipes.com reflects the cooking habits of Americans, at least when the website was only known for cookies.
Number of recipes that mention cookie in recipe title
In conclusion, my web scraping of the top 20 recipes on allrecipes.com was successful. There was some interesting trends in the use of ingredients over time. However, with only 380 observations, it was difficult to interpret trends that maybe present with a larger dataset.