Recipes and Generative AI

Published in

HelloTech

9 min readOct 24, 2023

Disclaimer: HelloFresh does not currently use generative AI models anywhere in the recipe creation pipeline. Everything in this article is purely research and exploration.

Our HelloFresh menus are created by local recipe developers who carefully curate the meal selection taking into consideration local preferences and seasonal ingredients. Our core value is to listen to our customers and use their feedback to continually optimise our service and develop new recipes.

An assortment of prepared HelloFresh recipes, these need to be liked by our customers but also created with as little waste (of time, food and money) as possible.

If a recipe is suboptimal, we lose customer satisfaction, time and revenue in refining it. How might one go about tackling these challenges? We’re going to explore one way in this blog: generative AI — more specifically: generative language models.

We will explore how training language models on the order of ~1M parameters can be a relatively easy task, generating models that are fit for purpose, are not costly to train or host and can run in realtime on CPUs without any need for heavy investment in infrastructure or setup, and have a significantly lower risk for hallucinations since the vocabulary (“action space”) is limited to the culinary domain. These are some substantial advantages compared to large language models which have parameters on the scale of billions.

Generative AI for Recipes?

The idea of AI generated food isn’t new. You can log into ChatGPT now and ask it to give you an idea of what to cook, or give it a list of ingredients and tell you what to make. It might generate something that tastes great, or not. But these are not necessarily aligned to the HelloFresh brand, style, ingredients or recipe philosophy.

ChatGPT’s version of a “Teriyaki Chicken” recipe (full recipe not shown). Not necessarily aligned to the HelloFresh brand, assortment, style or flavour profile. It also lacks the actual creation, refinement and testing by a chef — the much needed “human touch”.

In this blog post, we will specifically explore the creation of recipe ideas (titles or headlines, no full recipes) through a language model that conforms to HelloFresh’s business domain and style via some special prompts. Noticeably, we will not explore “large” (billions of parameters large) language models for this task, but a “tiny” (hundreds of thousands of parameters large) language model.

Why not use LLMs for this purpose?

This is a good question, if ChatGPT can already give you ideas about what to cook, why not use the API? Or self host an open source LLM? Well there are multiple reasons why:

These pre-trained LLMs may not conform to the HelloFresh brand, guidelines, style, ingredients, taste profile or other constraints without significant prompt engineering or fine-tuning.
Hosting and fine-tuning LLMs for such a specific and simple machine learning problem may be overkill, especially in terms of costs and time for experimentation. It may make sense later though.
Using APIs (like ChatGPT) brings concerns around pricing, handling of PII, and model tuning . A small internally trained language model doesn’t have the same drawbacks.
Some of the licenses of open source LLMs and APIs don’t allow for commercial use.
Exploration on how much of current LLM ideas can be brought to tinier language models for our niche language domain.

With that out of the way, let’s start by understanding language modelling.

Language Modelling

Language modelling primarily refers to one specific task in machine learning. It is the task of modelling the probability distribution of next words (or tokens) given the current and previous words (or token).

A token simply refers to the unit of language we choose to work with (characters of the alphabet, words, sub-words, parts of sentences etc are all valid “tokenisations”).

The probability of a sentence which is composed of 3 tokens: t₁, t₂, t₃ can then be expressed as:

P(t₁, t₂, t₃) = P(t₁)P(t₂|t₁)P(t₃|t₁, t₂)

Nowadays, the most common way to model these probability distributions is via some neural network (or function) “Fₚ”, where “p” are the function’s parameters. This can be expressed as:

P(t₂ | t₁) = Fₚ(t₁)

or more generally as:

P(tₙ₊₁ | tₙ, tₙ₋₁, tₙ₋₂, …, t₁) = Fₚ(tₙ, tₙ₋₁, tₙ₋₂, …, t₁)

The idea is that the set of parameters “p” need to be tuned such that these probabilities are maximised over the entire training corpus, which in our case is HelloFresh recipe titles.

The function F these days is usually some variation of a transformer decoder, where p are the billions (in case of large language models) of parameters that need to be tuned to fit this distribution.

Transformers:

The transformer architecture was initially introduced in the paper “Attention is all you need”. The paper details an encoder-decoder architecture where the encoder encodes information from one domain, while the decoder tries to use this information to generate most probable tokens in another domain. This can be highly useful for tasks like language translation, where you might want to encode a french sentence, and use that to inform the translation to english in the decoder.

Attention refers to this model basically applying different weights to different tokens or features, such that higher weights mean more “attention” is paid to those tokens (this is a very crude but sufficient explanation of attention).

The original transformer architecture proposed in the paper “Attention is all you need” by Vaswani et al. The block on the left is generally referred to as the “encoder”, while the block on the right is a “decoder”.

Observant readers will realize here that encoder-decoder architectures aren’t necessarily suitable for our “next token prediction” task, since we aren’t trying to do something like translation. This is where the concepts of causal self attention and GPTs (generative pre-trained transformers) comes in.

Causal self attention is the task of applying attention across a series of tokens, without letting future tokens affect the attention on the current token. This is precisely how we’ve formulated the language modeling task above (Probability of “tᵢ” is dependent on all tokens that came before and up to the iᵗʰ token but not after the iᵗʰ token).

It turns out that this is a really appropriate task for the decoder in the transformer architecture. This combined with huge amounts of language data scraped from the internet, forms the basis for pre-training these generative transformers (this is literally where the name GPT comes from). This turns out to be a very powerful scheme for learning a robust language model, and this idea is explored in the “GPT-1” and “GPT-2” papers by OpenAI.

The GPT architecture we choose to train for recipe titles corresponds to the GPT-2 architecture. Curious readers can also look at the original codebase of GPT-2.

Modified GPT-2 decoder without positional encoding used in this post.

For our version of the GPT, we remove the positional encodings — this is heavily motivated by the findings from “Transformer language models without positional encoding still learn positional information” by Haviv et al. The paper shows that for small context lengths, transformers with and without positional encodings perform almost identically. This is great for us, since the average number of tokens in our recipe titles is about 60.

Data Collection and Preprocessing:

HelloFresh has been around for almost 12 years now and today operates 7 brands across 18 markets. This means that it has a significant database of customer approved recipe data — across countries, cuisines, dietary preferences, food trends and regional ingredients, for now we used only English language recipes.

This is an encouraging sign, since it means we have data that is diverse, a good ingredient for data required for machine learning. HelloFresh believes in data driven decision making, to enable this across the company, it has implemented a “data mesh”, enabling anyone to consume or produce most types of data. This works to our advantage as data collection is simply a matter of writing one query.

After preprocessing this data, we get recipe titles that have been standardised to look like this:

balsamic glazed dutch carrots with thyme  and  pepitas
creamy aioli potato salad with dill  and  spring onion
one-pan cheesy black bean tacos with green pepper  and  smoky red pepper crema

For training, we need to add some special tokens into this recipe title. These tokens serve as prompts to the model. They are listed as:

“[CUI]” — represents the start of a cuisine.
“[PROT]” — represents the primary protein in the recipe
“[SOT]” — represents the start of the recipe title.
“[EOT]” — represents the end of the title.

So the recipe titles are transformed to look like:

[CUI] fusion [PROT] none [SOT] balsamic glazed dutch carrots with thyme and pepitas [EOT]
[CUI] fusion [PROT] veggie [SOT] creamy aioli potato salad with dill and spring onion [EOT]
[CUI] mexican [PROT] veggie [SOT] one-pan cheesy black bean tacos with green pepper and smoky red pepper crema [EOT]

This representation helps us to be able to generate recipe titles in many ways. For example, if we just wanted a recipe title without caring about the cuisine or protein, we can feed in the “[SOT]” token to the model and wait for the “[EOT]” token. Similarly, if we wanted to condition the generation of the title on some protein and cuisine, we could do: “[CUI] Japanese [PROT] chicken [SOT]” and wait for the “[EOT]” to get a Japanese recipe that uses chicken. This works because of the conditional probabilities explained above:

P(t₆ | t₁=[CUI], t₂=Japanese, t₃=[PROT], t₄=chicken, t₅=[SOT]) = Fₚ(t₁=[CUI], t₂=Japanese, t₃=[PROT], t₄=chicken, t₅=[SOT])

Let’s train a model:

The process of training using the tokenized recipe title involves a data loader that generates two samples. The first is the input sequence to the model, and the second is the sequence that the model should predict.

Training on the corpus of recipes from HelloFresh, we can visualize the loss going down and how the model’s generation performance evolves through the course of training.

The training and validation negative log likelihood error (lower is better).

The training and validation perplexity (lower is better).

The training progresses as expected, with some overfitting which we don’t investigate in this blog. Now let’s see how the generation quality evolves over the course of training:

At Epoch 0 (One pass of training over the entire data):

Prompt: [CUI]
Model's Output: mediterranean [SOT] harissa chicken over couscous bowls 
                with pickled onion, feta cheese [EOT] 
Prompt: [PROT]:
Model's Output: pork [SOT] sausage and creamy pesto gnocchi with wilted
                spinach [EOT] 
Prompt: [SOT]
Model's Output: pan - fried beef fusilli with fresh fennel salad and 
                parmesan [EOT]

Already not bad, by the time we reach Epoch 39 (40 passes over the training data):

Prompt: [CUI]
Model's Output: fusion [PROT] pork [SOT] crispy chorizo and onion pizza with
                rocket and parmesan salad [EOT] 
Prompt: [PROT]
Model's Output: pork [CUI] fusion [SOT] crumbed pork and creamy slaw with 
                sweet potato chunks and peppercorn sauce [EOT] 
Prompt: [SOT]
Model's Output: creamy lemon chicken breasts with mashed potatoes and 
                roasted zucchini [EOT]

This looks reasonable, so we integrate this into an app to make it usable and build some extra features to help recipe ideation.

Streamlit App:

We use Streamlit to quickly setup an app that uses the model to generate recipe titles.

What the Streamlit app looks like. It allows a user to prompt the model easily and optionally by selecting a cuisine, a primary protein, and even some starting words in the recipe title.

Alongside it, the app also uses the embeddings generated from the model to find existing HelloFresh recipes that are closest to the generated recipe. This enables us to figure out how “novel” a generated recipe title is. For these “closest” recipes, we also show some metrics (if available) like: what the recipe rating is, last time it was on the menu and cost. This may help us to know if this generated recipe is a good, novel enough candidate to iterate on.

Clicking on a recipe shows it’s closest existing neighbours along with their available metrics

Conclusion

There we have it, a fast novel recipe idea generator for HelloFresh recipes. The tool is powered by a “tiny” GPT model running in Streamlit, running in near real-time on a CPU, without any pre-training. The performance is sufficient even without pre-training or using a larger model.

The use of “similar” recipes to judge performance can be improved. We will need to investigate if reinforcement learning can be applied to learn policies to generate titles that maximize (or minimize) certain metrics (like recipe ratings, to generate only recipes that are likely to score high with customers) with applications of papers like Direct Preference Optimisation or Proximal Policy Optimisation for this purpose. Alongside there will be more work into deep learning for recipe translations, generative AI for recipe images, title and steps generation and curation. These will be discussed in possible follow ups to this post.