Using Data for A Recipe Recommendation System
The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Natural Language Processing for Everyday Use
Food waste is a major problem in the United States. According to data from Feeding America, up to 40% of food is wasted, or approximately 108B pounds of food or 130B meals and the equivalent of $40B annually.
We tend to accumulate ingredients over time when we plan meals and end up not making the meal or buy things on sale. We end up with a mixture of odds and ends that may not combine to make a recipe we would normally make. Traditionally when meal planning you would identify meals you want to make, write a list of ingredients to buy. When trying to use up odds and ends in your pantry you need to reverse this process and be able to search for recipes based on the ingredients.
Objectives and Overview
The following are the primary goals of this project:
1) Produce an automated categorization of recipes
2) Produce a model capable of recommending recipes based on the ingredients available
3) Produce a metric to easily compare the results of different approaches
4) Develop an application for users to search recipes based on ingredients.
We first conducted some exploratory data analysis and applied standard and project-specific cleaning techniques. We used K means, LDA, Top2Vec, BERTopic, and CorEx algorithms for the data and developed a scoring metric to compare these topic modeling techniques. Lastly, we produced a web application to allow users to select ingredients and browse potential recipes with the ability to narrow them down by category.
The data set used here is called RecipeNLG. It is a collection of recipes derived from various websites on the internet. The data include a row ID, "title" of the recipe, "ingredients" list, "directions", a "link" to the source, a "source" description, and a Named Entity Recognition "NER" column that is essentially a clean ingredients list.
Exploratory Data Analysis
Exploratory data analysis for this project consisted of reviewing the most frequent words (e.g. unigrams, bigrams, trigrams) in the data set and the distribution of "document" (i.e. recipe) length to identify any outliers or data integrity issues. The most frequent words in the directions column highlighted stopwords and some words that although not stopwords weren't particularly useful in grouping recipes. Useful bigrams were identified such as "sour cream," "lemon juice," and "cream cheese." These bigrams describe specific ingredients that have different meanings when examining the unigrams they are composed of. Similarly trigrams such as "virgin olive oil" were identified.
The frequency of the top 20 unigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.
The frequency of the top 20 bigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.
The frequency of the top 20 trigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.
We also analyzed the NER column to find the most commonly used ingredient in all the recipes in the dataset. The figure below shows the top 50 ingredients used in most recipes.
Frequencies of top 50 most common ingredients in different recipes
Figure below shows the ingredient network, where each node represents an ingredient and an edge between two ingredients represents them being together in a recipe. Size of each node represents the frequency with which that ingredient is used. Common ingredients such as salt, garlic, sugar, and flour are used in many recipes and are therefore larger.
Figure. Ingredient Network
We attempted to clean the directions and ingredients column using this as a guide. Stop words and punctuation were removed using texthero. Parts of speech were tagged and verbs and adverbs were removed to try and isolate the key ingredients that separated the recipes into distinct topics. Remaining words were lemmatized.
As further words were identified as potentially interfering with successful topic separation those words were added to the stop word list. These included spices that spanned topics, such as salt which appears (in varying quantities) in sweet and savory recipes.
Although a good portion of time was spent cleaning the data, the NLP-specific models tended to perform better with the raw directions text as opposed to the altered/cleaned version.
Topic Modeling/Clustering Methods Applied
K-means clustering is an unsupervised machine learning algorithm used for grouping of similar data points. To group together data points k-means looks for a fixed number of clusters (defined by the user) and identifies centroids of each cluster, it then assigns data points based on proximity.
As defined above we can conclude that k-means is more suitable for numerical data rather than text data, and therefore is not often used for NLP analysis. However, in our analysis we wanted to consider many methods in order to establish their efficiency and draw a comparison among them.
In order to use the k-means algorithm to cluster data we must transform text into numerical vectors, based on the word importance, or what is known as feature extraction. When using k-means the cleaned ‘NER’ column will be used for analysis. TF-IDF, short for Term Frequency-Inverse Document Frequency, is a numerical statistic that intends to reflect the importance of a word in our ingredients list (NER), where a term with a high weight is considered relevant. The new data created by TF-IDF will then be used for k-means clustering.
Latent Dirichlet Allocation (LDA)
- The inability to classify bigrams and trigrams as a single entity makes statistical language identification hard, for example (black pepper, fresh salad, olive oil).
- The extended execution time required to train model on the data set of 2GB.
- The need for constant supervision and cleaning of the data is necessary because certain terms are ambiguous and belong to more than one topic.
Top2Vec with Doc2Vec
Top2Vec is a fully integrated approach to analyzing NLP data (Angelov, 2020). It includes a step where the "documents" are embedded into vectors. The specific algorithm used to embed the documents is interchangeable. The dimensionality is then reduced using UMAP and dense pockets of documents are located using HDBSCAN. Once the dense areas are located the centroids are calculated and most relevant documents are returnable.
Here we used Doc2Vec to embed the documents. Doc2Vec is a technique that uses neural networks to derive semantic knowledge by training neurons to predict subsequent words in a sentence as well as encoding an overall context or theme of the entire document.
We applied Top2Vec using Doc2Vec to embed the documents to the "clean" version of the ingredients and directions columns, in addition to the unaltered ingredients and directions columns. The model appeared to perform better with the full text as compared to the "cleaned" versions. Reflecting on how the embedding is completed this makes sense.
BERTopic uses similar clustering techniques as Top2Vec. However, it uses sentence-transformers and c-Tf-Idf to generate dense clusters and interpretable topics (Grootendorst, 2022). This algorithm works in the following three phases:
- Embed Documents: BerTopic performs the document embedding extraction by using Sentence Transformers, a form of transfer learning. This phase encodes a sentence or short text paragraphs into a fixed-length vector.
- Cluster Documents: Similar to Top2Vec UMAP is used to reduce the dimensionality of embeddings and HDBSCAN to identify and cluster semantically similar recipes using HDBSCAN
- Create Topic Representation: BerTopic uses a class-based TF-ID (c-TF-IDF) technique to identify the most important topics in each cluster. C-Tf-IDF method treats all documents in a single cluster as a single document and then computes TF-IDF, this returns the most important topics (i.e. keywords) within a cluster.
Correlation Explanation (CorEx)
Correlation Explanation (CorEx) is a relatively recent development in the NLP space. It functions by correlating words based on their co-appearance within documents (Gallagher, 2017). CorEx allows for semi-supervised training by allowing the user to provide "anchor" words to the model. These anchor words are weighted and allow the user to guide the model to topics that they deem important that the model may have otherwise missed.
A few things of note: CorEx doesn't yet support ngrams. Keywords can only belong to a single topic (overridable by using anchor words). And the assignment of documents to topics is probabilistic and documents aren't restricted to a single topic. Documents can be assigned to all topics, no topics or anything in between.
Developing Intuitive Categories
One of the objectives of this project was to identify and automatically categorize the recipes into intuitive categories. The recipes can be categorized using clustering algorithms. However, assigning a meaningful name to each cluster or category is challenging in unsupervised learning. There are several unsupervised methods that suggest a list of potential topics for a cluster, such as BerTopic, word2vec, and LDA.
However, assigning a meaningful name to each cluster requires manual intervention and domain expertise. This work proposes an automated method to assign a meaningful name to each recipe category(cluster). This is done in the following steps:
- We created a clean_title column, which contains the only name of the recipe, not the ingredient. For example, a recipe titled "Apple pie" will become "Pie" in the clean_title column. This is done by removing all words in the title that are also present in the corresponding NER column (i.e. the clean ingredients list).
- We combine directions and clean_title columns and use the BerTopic model to cluster similar recipes together. Figure below shows all the clusters (topics) produced by Bertopic. As we can see some of the most important terms for clusters are add, heat, drain, butter, which does not represent all recipes in the topics.
Figure. Results obtained from Bertopic on direction and clean_title
- Hence, we identified the most frequent cleaned_title in each cluster and selected the category name for each cluster based on the following rule:
- - The most frequent cleaned_title will be used to represent the category of each recipe in the corresponding cluster if the difference in the frequency of the most and second most frequent clean_title is less than 50%.
- - If the difference in the frequency of the first two most frequent clean_title is greater than or equal to 50%, then both titles concatenated with an 'and' will represent the category of each recipe in that cluster.
- - If two or more clusters share the same most frequent clean_title, they will be combined into one cluster.
2-dimentional visualization of clusters resulted from HDBSCAN algorithm in BerTopic computation
The categories identified were cake and bread, cookie, pie, soup, dip, meat, muffin, salad, fudge, casserole, cocktail, ice cream, chili, brownie, meatball, and fish. These categories were derived to allow users to choose a subcategory of food when selecting recipes.
Developing a Comparison Metric for Evaluating Topic Modeling
Unsupervised learning techniques are notoriously difficult to evaluate, because unlike supervised learning the data are not labeled and therefore it is difficult to assess how well the technique performed without manually reviewing the result. The models used in this project had various parameters that we had the option of tuning. However, these parameters were often unique to each method and didn't necessarily have an analogue in the other models.
Therefore, comparing the models and attempting to hold variables constant was virtually impossible. Lastly, assessing topic modeling is especially difficult. The model returns specific themes as models - how do you define one topic as "good" and one as "bad?" Further, what do you do when your model generates hundreds of topics? Rating each one manually can be time consuming.
We addressed this problem in the current project by leveraging the "NER" or clean ingredients column. This column that was provided with the data set contained a list of strings (unigrams and bigrams mostly) with the units and non-noun parts of speech stripped away. We combined this column for every recipe included in each topic and then identified the amount of overlap between recipes within the topics.
Specifically, we took the number of unique strings and divided this number by the total number of strings (with duplicates maintained) to arrive at a score we termed "topic concentration." This number ranging from 0-1 indicated how pure each topic was. In other words, the more overlap in ingredients between recipes would result in a score closer to zero and would be deemed a better, more concentrated topic. Those with few ingredients in common would be scored closer to 1 and are concentrated less well.
From top to bottom: formula for calculating topic concentration, hypothetical example of a reasonably well concentrated topic, and hypothetical example of a poorly concentrated topic.
Evaluation of Topic Modeling/Clustering Approaches
Utilizing the "topic concentration" score we evaluated the topics produced by BERTopic, Top2Vec and CorEx. We plan to add Kmeans clustering to this evaluation and potentially LDA. BERTopic produced 256 topics and had a median topic concentration of 0.484 and a range of 0.082 to 0.898. Top2Vec using Doc2Vec embedding produced 228 topics and had a median topic concentration of 0.375 with a range of 0.123 to 0.65. Lastly, CorEx was specified to produce 50 topics and had a median topic concentration of 0.193 with a range of 0.079 to 0.467.
Figure. Violin plots of the topic concentration scores for BERTopic, Top2Vec and CorEx. Lower scores indicate more overlap in recipe ingredients and therefore better, more cohesive topics.
R shiny App Development
A recipe finder Rshiny App was developed in RStudio. As mentioned earlier, food waste is a major problem in the United States and a lot of other developed nations. Even though individual food waste is a small fraction of that, we can optimize our practices to minimize it and contribute to the overall decrease of food waste.
A way of doing that would be to plan meals ahead of time and buy what we need, however it is not often possible and more often than not we end up with some ingredients in the fridge that do not fit the recipes we already know. As a solution our app would work in a way to input these odd, not fitting, ingredients and the app would suggest a recipe that best fits based on the algorithm we used for our classification model.
“What’s for dinner?” Tab
In our RECIPES App “What’s for dinner?” tab takes ingredients as input and based on the ML algorithm we used for classification returns the most valid recipes.
A separate algorithm was used for topic modeling which gave us very accurate and distinct categories. The user would input ingredients first and the app would return recipes from all categories, then the user would be able to narrow down the recipes further based on a category. Someone who’s in the mood for a snack might prefer a “salad” category over “casserole” category.
“Soups”, “Salads and Dressings”, “Main”, “Baked Goods” Tabs
The remaining tabs in the app already contain pre categorized recipes. Those tabs would be for users who already have an idea of what type of dish they want, whether it’d be soup, salad or main dish or maybe someone who is throwing a dinner party and would like to explore all categories. After selecting a tab then they would be able to narrow down the list of recipes based on ingredients selected.
Examining the topic concentration scores, it appears CorEx performed the best (raw score only, no statistic was computed). However, because the number of topics has to be specified in CorEx and because it was one-fifth the size of the other two models this may not be comparable. Additionally, CorEx produces the probability that the recipe belongs to each topic.
Therefore, even when only examining the maximum probability produced for each recipe, the recipe can be placed into more than one topic or no topics at all. This means that the total number of recipes included in the topics when counting recipes each time they appear in a topic was 43,646 even though the number of unique recipes in the sample was 22,311. This inflates the denominator and artificially suppresses the score. The figure below displays the number of topics each recipe was assigned to.
Frequency counts showing how many recipes were assigned to zero topics, one topic, or more than one topic. The peak (i.e. the mode) is 4 topics with many being assigned to as many as 16 and some assigned to 37 of the 50 topics.
The second best performing topic modeling technique was Top2Vec (raw score only, no statistic was computed). Top2Vec also features built-in search functionality. This search function was used to test a few sample ingredient searches. These test searches returned recipes that logically fit into the ingredients provided. And perhaps more importantly, didn't return any recipes that seemed out of place when considering the ingredient list provided.
Example search results from Top2Vec topic modeling output. The list of input ingredients returned reasonable recipes, while not returning odd or unrelated recipes.
BERTopic performed the least well (raw score only, no statistic was computed) out of the three models. Because BERTopic uses "sentence transformers", a transfer learning technique, it is possible that the narrowness of the subject area (i.e. recipes) stands in contrast to the breadth the model was trained on. It is also important to note that the models also had descending range sizes.
Main Ingredient Identification
Identifying the main ingredient of a recipe would help in grouping similar recipes together. We would like to continue to work on this data set and quantify the percent of the overall ingredients each ingredient makes up by extracting the measurement metric and standardizing it across all ingredients. This would also help in returning more relevant recipes, as it would preclude returning a recipe where a main ingredient is not in the searched list.
We noted during the project that recipes for those with dietary restrictions (e.g. vegetarian, vegan, gluten free) were often inextricable from other recipes as the overlap of spices may have caused them to be grouped together. We would like to make these specific recipes identifiable. Additionally, adding genre to allow users to search by the genre of food they desired would make this more user friendly.
Lastly, returning nutritional information for each recipe could help users select recipes that best fit their dietary needs.
Angelov, Dimo (2020). Top2Vec: Distributed Representations of Topics. arXiv:2008.09470.
Gallagher, Ryan J. et al. Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge. Transactions of the Association for Computational Linguistics, [S.l.], v. 5, p. 529-542, dec. 2017. ISSN 2307-387X. Available at: <https://transacl.org/ojs/index.php/tacl/article/view/1244>.
Grootendorst, Maarten (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. CarXiv preprint arXiv:2203.05794.
Le, Quoc & Mikolov Tomas (2014). Distributed Representation of Sentences and Documents. arXiv:1405.4053v2.
Ver Steeg, Greg (2015). corex_topic. https://github.com/gregversteeg/corex_topic (2022).