Data Study on SeriousEats

Posted on Feb 2, 2020
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction

Cooking dinner at home is healthier and more affordable than eating out. But it can be difficult after a long day at work to summon the energy to get dinner on the table. SeriousEats is recipe blog that has high-quality content data that can be leveraged to make it easier for their users to cook at home.

If we use the total number of ratings as an indicator of popularity, SeriousEats was most popular in 2011 after receiving two James Beard awards. Since then, popularity has declined even though the recipe quality has remained high (Figure 1). By developing new tools and recipes, SeriousEats could rejuvenate their user base and see an increase in their advertising revenue.

Data

Data Study on SeriousEatsFigure 1. The number of ratings (left) suggests that SeriousEats' popularity may be declining, despite recipe quality (right) staying high.

I scraped the SeriousEats website for recipe data to identify areas to generate new content and build a recipe suggestion tool to make it easier to find recipes. I used the Scrapy python package to extract data from individual recipes (Figure 2). In total, I collected 13 features including ingredient list and ratings from 12,620 recipes.

Data Study on SeriousEatsFigure 2. My scrapy spider started at seriouseats.com and parsed through each topic until extracting the individual recipes. Numbers indicate the number of pages scraped at each level.

My first goal was to identify areas to generate new content. SeriousEats has over 50 recipes for French, Mexican, and Italian cuisines. While these are recipes have very high ratings, adding a few more of these recipes may not attract more users. Instead, adding more recipes from African or Caribbean cuisines may be more promising. These cuisines are underrepresented  with less than 10 recipes, but have high average (and median) ratings (Figure 3). 

Data Study on SeriousEats
Figure 3. Looking at the number of recipes per cuisine (left), we see there are few African and Caribbean recipes. However, these cuisines are just as popular according to their average rating (right).

New content may not be enough to draw more users. If you have tools that make life easier, then I think you would see an increase in readership. I built a prototype recipe suggestion tool to make it easier to decide what to make for dinner.  A user can type in whatever ingredients are in their fridge or whatever they have a craving for and this tool will give them a link to a recipe that closes matches those ingredients. 

Cosine Similarity

To build my tool, I used a common technique in Natural Language Processing called cosine similarity. If you think of every word as an axis, you can create a vector for each sentence. Now that your sentence is a vector, you can calculate how close it is to another vector by taking the cosine between the two (Figure 4). This elegant approach allows us to easily quantify the similarity between two collections of words.

Figure 4. Cosine Similarity can be used to determine how similar two sentences are to each other (left, middle). Image Credit: Vit Novotny and Michael Penkov.

I applied this approach to determine the similarity between ingredient lists (Figure 5).

  1. Make a Bag of Words that has every ingredient from every recipe.
  2. Count the number of times each ingredient appears in a given recipe.
  3. Calculate the cosine similarity between your search and each recipe.
  4. Return the link of the recipe with the highest score.

On my first test run, I entered ingredients which I thought would return a shrimp curry: shrimp, coconut, cilantro, curry. Instead, the search tool recommended a coconut and chocolate dessert! Not even close! So why did my test fail?

 

Figure 5. Ingredient lists can be converted to vectors and compared in the same way (left). Using the L2 normalization prevents recipe length from biasing our search results (right).

Looking back at my code, I realized I wasn’t properly normalizing my similarity score. The dessert recipe used coconut 12 times which skewed my results. To account for this, I took advantage of the set data type in Python which stores unique occurrences of items. I also added in the L2 normalization to prevent recipe length from impacting the cosine similarity score.

Conclusion

After these modifications, my tool returns the expected result. As a prototype, I think this suggestion tool has a lot of promise to attract users. The current Serious Eats search tool returns suggestions with some but not all of your items. Also, the current search tool often returns a large number of recipes which doesn't help if you're feeling indecisive. Using my approach, SeriousEats can attract new users by making it easier to decide what’s for dinner.

About Author

Josefa Sullivan

Josefa has a PhD in Neuroscience from the Icahn School of Medicine at Mount Sinai and a BA in Biochemistry & Molecular Biology from Boston University. Her interests include applying data science to the healthcare & biotech fields,...
View all posts by Josefa Sullivan >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI