Using Data for A Recipe Recommendation System

The skills the author demonstrated here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Natural Language Processing for Everyday Use

Using Data for A Recipe Recommendation SystemSource:   

Food waste is a major problem in the United States. According to data from Feeding America, up to 40% of food is wasted, or approximately 108B pounds of food or 130B meals and the equivalent of $40B annually. Using Data for A Recipe Recommendation System


We tend to accumulate ingredients over time when we plan meals and end up not making the meal or buy things on sale. We end up with a mixture of odds and ends that may not combine to make a recipe we would normally make. Traditionally when meal planning you would identify meals you want to make, write a list of ingredients to buy. When trying to use up odds and ends in your pantry you need to reverse this process and be able to search for recipes based on the ingredients.

Using Data for A Recipe Recommendation System

Objectives and Overview

The following are the primary goals of this project:

1) Produce an automated categorization of recipes 

2) Produce a model capable of recommending recipes based on the ingredients available 

3) Produce a metric to easily compare the results of different approaches 

4) Develop an application for users to search recipes based on ingredients.

We first conducted some exploratory data analysis and applied standard and project-specific cleaning techniques. We used K means, LDA, Top2Vec, BERTopic, and CorEx algorithms for the data and developed a scoring metric to compare these topic modeling techniques. Lastly, we produced a web application to allow users to select ingredients and browse potential recipes with the ability to narrow them down by category.

Data set

The data set used here is called RecipeNLG. It is a collection of recipes derived from various websites on the internet. The data include a row ID, "title" of the recipe, "ingredients" list, "directions", a "link" to the source, a "source" description, and a Named Entity Recognition "NER" column that is essentially a clean ingredients list.

Exploratory Data Analysis

Exploratory data analysis for this project consisted of reviewing the most frequent words (e.g. unigrams, bigrams, trigrams) in the data set and the distribution of "document" (i.e. recipe) length to identify any outliers or data integrity issues. The most frequent words in the directions column highlighted stopwords and some words that although not stopwords weren't particularly useful in grouping recipes. Useful bigrams were identified such as "sour cream," "lemon juice," and "cream cheese." These bigrams describe specific ingredients that have different meanings when examining the unigrams they are composed of. Similarly trigrams such as "virgin olive oil" were identified.

Using Data for A Recipe Recommendation System


The frequency of the top 20 unigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.

Using Data for A Recipe Recommendation System


The frequency of the top 20 bigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.

Using Data for A Recipe Recommendation SystemFigure

The frequency of the top 20 trigrams in the directions column from the sample data set after cleaning. Cleaning consisted of removing stop words and the majority of verbs and adverbs.


We also analyzed the NER column to find the most commonly used ingredient in all the recipes in the dataset. The figure below shows the top 50 ingredients used in most recipes.

Using Data for A Recipe Recommendation System              Figure

Frequencies of top 50 most common ingredients in different recipes


Figure below shows the ingredient network, where each node represents an ingredient and an edge between two ingredients represents them being together in a recipe. Size of each node represents the frequency with which that ingredient is used. Common ingredients such as salt, garlic, sugar, and flour are used in many recipes and are therefore larger.

Using Data for A Recipe Recommendation System

Figure. Ingredient Network 

Data Cleaning

We attempted to clean the directions and ingredients column using this as a guide. Stop words and punctuation were removed using texthero. Parts of speech were tagged and verbs and adverbs were removed to try and isolate the key ingredients that separated the recipes into distinct topics. Remaining words were lemmatized.

As further words were identified as potentially interfering with successful topic separation those words were added to the stop word list. These included spices that spanned topics, such as salt which appears (in varying quantities) in sweet and savory recipes.

Although a good portion of time was spent cleaning the data, the NLP-specific models tended to perform better with the raw directions text as opposed to the altered/cleaned version.

Topic Modeling/Clustering Methods Applied

K-Means Clustering 

K-means clustering is an unsupervised machine learning algorithm used for grouping of similar data points. To group together data points k-means looks for a fixed number of clusters (defined by the user) and identifies centroids of each cluster, it then assigns data points based on proximity. 

As defined above we can conclude that k-means is more suitable for numerical data rather than text data, and therefore is not often used for NLP analysis. However, in our analysis we wanted to consider many methods in order to establish their efficiency and draw a comparison among them.

In order to use the k-means algorithm to cluster data we must transform text into numerical vectors, based on the word importance, or what is known as feature extraction. When using k-means the cleaned ‘NER’ column will be used for analysis. TF-IDF, short for Term Frequency-Inverse Document Frequency, is a numerical statistic that intends to reflect the importance of a word in our ingredients list (NER), where a term with a high weight is considered relevant. The new data created by TF-IDF will then be used for k-means clustering.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is an older approach to topic modeling that employs a probabilistic approach to identifying related terms that may represent a theme. In LDA the words are represented as topics, and documents are defined as a group of these topics. For a given document, LDA estimates the probabilities for each word in the document. Using the word probability, the topics are created, and eventually the topics are categorized into documents. However, for this project, we decided not to move forward with LDA for the following reasons:
  1. The inability to classify bigrams and trigrams as a single entity makes statistical language identification hard, for example (black pepper, fresh salad, olive oil).
  2. The extended execution time required to train model on the data set of 2GB.
  3. The need for constant supervision and cleaning of the data is necessary because certain terms are ambiguous and belong to more than one topic.
The Latent Dirichlet Allocation method used for this project showed the best results when the number of topics was set to 10. Any change in the number of topics led to a complicated interpretation of results by combining recipes from different categories (i.e., desserts, soups, meat dishes) into one topic.

Top2Vec with Doc2Vec

Simple introductions to the method with examples can be found here: Link1 Link2. Here is the official Documentation of Doc2Vec and Paper describing the embedding method.

Top2Vec is a fully integrated approach to analyzing NLP data (Angelov, 2020). It includes a step where the "documents" are embedded into vectors. The specific algorithm used to embed the documents is interchangeable. The dimensionality is then reduced using UMAP and dense pockets of documents are located using HDBSCAN. Once the dense areas are located the centroids are calculated and most relevant documents are returnable.

Here we used Doc2Vec to embed the documents. Doc2Vec is a technique that uses neural networks to derive semantic knowledge by training neurons to predict subsequent words in a sentence as well as encoding an overall context or theme of the entire document.

We applied Top2Vec using Doc2Vec to embed the documents to the "clean" version of the ingredients and directions columns, in addition to the unaltered ingredients and directions columns. The model appeared to perform better with the full text as compared to the "cleaned" versions. Reflecting on how the embedding is completed this makes sense.


BERTopic uses similar clustering techniques as Top2Vec. However, it uses sentence-transformers and c-Tf-Idf to generate dense clusters and interpretable topics (Grootendorst, 2022). This algorithm works in the following three phases:

  • Embed Documents: BerTopic performs the document embedding extraction by using Sentence Transformers, a form of transfer learning. This phase encodes a sentence or short text paragraphs into a fixed-length vector.
  • Cluster Documents: Similar to Top2Vec UMAP is used to reduce the dimensionality of embeddings and HDBSCAN to identify and cluster semantically similar recipes using HDBSCAN
  • Create Topic Representation: BerTopic uses a class-based TF-ID (c-TF-IDF) technique to identify the most important topics in each cluster. C-Tf-IDF method treats all documents in a single cluster as a single document and then computes TF-IDF, this returns the most important topics (i.e. keywords) within a cluster.

Correlation Explanation (CorEx)

Examples and introduction can be found here: Link1 Link2

Correlation Explanation (CorEx) is a relatively recent development in the NLP space. It functions by correlating words based on their co-appearance within documents (Gallagher, 2017). CorEx allows for semi-supervised training by allowing the user to provide "anchor" words to the model. These anchor words are weighted and allow the user to guide the model to topics that they deem important that the model may have otherwise missed.

A few things of note: CorEx doesn't yet support ngrams. Keywords can only belong to a single topic (overridable by using anchor words). And the assignment of documents to topics is probabilistic and documents aren't restricted to a single topic. Documents can be assigned to all topics, no topics or anything in between.


Developing Intuitive Categories

One of the objectives of this project was to identify and automatically categorize the recipes into intuitive categories. The recipes can be categorized using clustering algorithms. However, assigning a meaningful name to each cluster or category is challenging in unsupervised learning. There are several unsupervised methods that suggest a list of potential topics for a cluster, such as BerTopic, word2vec, and LDA.

However, assigning a meaningful name to each cluster requires manual intervention and domain expertise. This work proposes an automated method to assign a meaningful name to each recipe category(cluster). This is done in the following steps:

  • We created a clean_title column, which contains the only name of the recipe, not the ingredient. For example, a recipe titled "Apple pie" will become "Pie" in the clean_title column. This is done by removing all words in the title that are also present in the corresponding NER column (i.e. the clean ingredients list).
  • We combine directions and clean_title columns and use the BerTopic model to cluster similar recipes together. Figure below shows all the clusters (topics) produced by Bertopic. As we can see some of the most important terms for clusters are add, heat, drain, butter, which does not represent all recipes in the topics.

Using Data for A Recipe Recommendation SystemFigure. Results obtained from Bertopic on direction and clean_title 

  • Hence, we identified the most frequent cleaned_title in each cluster and selected the category name for each cluster based on the following rule:
      - The most frequent cleaned_title will be used to represent the category of each recipe in the corresponding cluster if the difference in the frequency of the most and second most frequent clean_title is less than 50%.
      - If the difference in the frequency of the first two most frequent clean_title is greater than or equal to 50%, then both titles concatenated with an 'and' will represent the category of each recipe in that cluster.
      - If two or more clusters share the same most frequent clean_title, they will be combined into one cluster.

Using Data for A Recipe Recommendation SystemFigure

 2-dimentional visualization of  clusters resulted from HDBSCAN algorithm in BerTopic computation

The categories identified were cake and bread, cookie, pie, soup, dip, meat, muffin, salad, fudge, casserole, cocktail, ice cream, chili, brownie, meatball, and fish. These categories were derived to allow users to choose a subcategory of food when selecting recipes.


Developing a Comparison Metric for Evaluating Topic Modeling

Unsupervised learning techniques are notoriously difficult to evaluate, because unlike supervised learning the data are not labeled and therefore it is difficult to assess how well the technique performed without manually reviewing the result. The models used in this project had various parameters that we had the option of tuning. However, these parameters were often unique to each method and didn't necessarily have an analogue in the other models.

Therefore, comparing the models and attempting to hold variables constant was virtually impossible. Lastly, assessing topic modeling is especially difficult. The model returns specific themes as models - how do you define one topic as "good" and one as "bad?" Further, what do you do when your model generates hundreds of topics? Rating each one manually can be time consuming.

We addressed this problem in the current project by leveraging the "NER" or clean ingredients column. This column that was provided with the data set contained a list of strings (unigrams and bigrams mostly) with the units and non-noun parts of speech stripped away. We combined this column for every recipe included in each topic and then identified the amount of overlap between recipes within the topics.

Specifically, we took the number of unique strings and divided this number by the total number of strings (with duplicates maintained) to arrive at a score we termed "topic concentration." This number ranging from 0-1 indicated how pure each topic was. In other words, the more overlap in ingredients between recipes would result in a score closer to zero and would be deemed a better, more concentrated topic. Those with few ingredients in common would be scored closer to 1 and are concentrated less well. 


From top to bottom: formula for calculating topic concentration, hypothetical example of a reasonably well concentrated topic, and hypothetical example of a poorly concentrated topic.

Evaluation of Topic Modeling/Clustering Approaches

Utilizing the "topic concentration" score we evaluated the topics produced by BERTopic, Top2Vec and CorEx. We plan to add Kmeans clustering to this evaluation and potentially LDA. BERTopic produced 256 topics and had a median topic concentration of 0.484 and a range of 0.082 to 0.898. Top2Vec using Doc2Vec embedding produced 228 topics and had a median topic concentration of 0.375 with a range of 0.123 to 0.65. Lastly, CorEx was specified to produce 50 topics and had a median topic concentration of 0.193 with a range of 0.079 to 0.467.

Using Data for A Recipe Recommendation SystemFigure. Violin plots of the topic concentration scores for BERTopic, Top2Vec and CorEx. Lower scores indicate more overlap in recipe ingredients and therefore better, more cohesive topics.

R shiny App Development


A recipe finder Rshiny App was developed in RStudio. As mentioned earlier, food waste is a major problem in the United States and a lot of other developed nations. Even though individual food waste is a small fraction of that, we can optimize our practices to minimize it and contribute to the overall decrease of food waste.

A way of doing that would be to plan meals ahead of time and buy what we need, however it is not often possible and more often than not we end up with some ingredients in the fridge that do not fit the recipes we already know. As a solution our app would work in a way to input these odd, not fitting, ingredients and the app would suggest a recipe that best fits based on the algorithm we used for our classification model.


“What’s for dinner?” Tab


In our RECIPES App “What’s for dinner?” tab takes ingredients as input and based on the ML algorithm we used for classification returns the most valid recipes.

A separate algorithm was used for topic modeling which gave us very accurate and distinct categories. The user would input ingredients first and the app would return recipes from all categories, then the user would be able to narrow down the recipes further based on a category. Someone who’s in the mood for a snack might prefer a “salad” category over “casserole” category.


“Soups”, “Salads and Dressings”, “Main”, “Baked Goods” Tabs


The remaining tabs in the app already contain pre categorized recipes. Those tabs would be for users who already have an idea of what type of dish they want, whether it’d be soup, salad or main dish or maybe someone who is throwing a dinner party and would like to explore all categories. After selecting a tab then they would be able to narrow down the list of recipes based on ingredients selected. 


Examining the topic concentration scores, it appears CorEx performed the best (raw score only, no statistic was computed). However, because the number of topics has to be specified in CorEx and because it was one-fifth the size of the other two models this may not be comparable. Additionally, CorEx produces the probability that the recipe belongs to each topic.

Therefore, even when only examining the maximum probability produced for each recipe, the recipe can be placed into more than one topic or no topics at all. This means that the total number of recipes included in the topics when counting recipes each time they appear in a topic was 43,646 even though the number of unique recipes in the sample was 22,311. This inflates the denominator and artificially suppresses the score. The figure below displays the number of topics each recipe was assigned to.

Using Data for A Recipe Recommendation SystemFigure

Frequency counts showing how many recipes were assigned to zero topics, one topic, or more than one topic. The peak (i.e. the mode) is 4 topics with many being assigned to as many as 16 and some assigned to 37 of the 50 topics.

The second best performing topic modeling technique was Top2Vec (raw score only, no statistic was computed). Top2Vec also features built-in search functionality. This search function was used to test a few sample ingredient searches. These test searches returned recipes that logically fit into the ingredients provided. And perhaps more importantly, didn't return any recipes that seemed out of place when considering the ingredient list provided.

Using Data for A Recipe Recommendation SystemUsing Data for A Recipe Recommendation SystemImage

Example search results from Top2Vec topic modeling output. The list of input ingredients returned reasonable recipes, while not returning odd or unrelated recipes.

BERTopic performed the least well (raw score only, no statistic was computed) out of the three models. Because BERTopic uses "sentence transformers", a transfer learning technique, it is possible that the narrowness of the subject area (i.e. recipes) stands in contrast to the breadth the model was trained on. It is also important to note that the models also had descending range sizes.

Future Directions

Main Ingredient Identification

Identifying the main ingredient of a recipe would help in grouping similar recipes together. We would like to continue to work on this data set and quantify the percent of the overall ingredients each ingredient makes up by extracting the measurement metric and standardizing it across all ingredients. This would also help in returning more relevant recipes, as it would preclude returning a recipe where a main ingredient is not in the searched list.

Additional Categorizing

We noted during the project that recipes for those with dietary restrictions (e.g. vegetarian, vegan, gluten free) were often inextricable from other recipes as the overlap of spices may have caused them to be grouped together. We would like to make these specific recipes identifiable. Additionally, adding genre to allow users to search by the genre of food they desired would make this more user friendly.

Nutritional Information

Lastly, returning nutritional information for each recipe could help users select recipes that best fit their dietary needs.


Angelov, Dimo (2020). Top2Vec: Distributed Representations of Topics. arXiv:2008.09470. 

Gallagher, Ryan J. et al. Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge. Transactions of the Association for Computational Linguistics, [S.l.], v. 5, p. 529-542, dec. 2017. ISSN 2307-387X. Available at: <>.

Grootendorst, Maarten (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. CarXiv preprint arXiv:2203.05794.

Le, Quoc & Mikolov Tomas (2014). Distributed Representation of Sentences and Documents. arXiv:1405.4053v2.

Ver Steeg, Greg (2015). corex_topic. (2022).

About Authors

Monika Singh

Solution-oriented data scientist with a Ph.D. in computer science and five years of experience working as a researcher in digital forensic and IoT security. Passionate about data science with skills in data analytics, model building, and projections. Relevant...
View all posts by Monika Singh >

Aleksandra Galczak

Ambitious Data Scientist with a passion for machine learning and data analytics, with former experience and skills developed as a mechanical engineer, that include analytics and problem solving, as well as new found skills of coding in python,...
View all posts by Aleksandra Galczak >

James Reno, Ph.D.

Curious scientist ready for a new challenge and growth opportunity. Insatiable problem-solver with a record of quickly orienting to new domains, and asking the right questions to identify key factors. Team player with a knack for explaining data...
View all posts by James Reno, Ph.D. >

Stefan Nachtigall

NLP researcher with a master degree in comparative linguistics ( English, German, Russian, localization ). Currently working in ML and neuroscience fields.
View all posts by Stefan Nachtigall >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI