Use large language models (LLM) to perform ingredient list spellcheck #314

raphael0202 · 2023-05-23T10:10:03Z

Problem

Numerous quality errors about ingredient lists come from spelling errors. It's mostly due to errors during the OCR process, because the image is blurry or because of OCR model limitation (we use Google Cloud Vision). As a result, we have:

ingredients with spelling mistakes
ingredients not separated with comma (or other ingredient list separator), resulting in "unknown ingredient" warning
incorrect line continuation (the way Google Cloud Vision joins words to makes paragraphs): the ingredient list has unrelated words inside it.

I think (1) and (2) can be corrected using language models for spelling correction, (3) is more tricky.
We implemented a spellcheck module in Robotoff using Elasticsearch, but it's currently not good enough to be used without human supervision: it's currently unused and will be removed soon from the codebase.

Proposed solution

Explore the use of large language models for performing ingredient spellcheck. We must ensure the model does not hallucinate new ingredients or modify ingredients that were already valid. ChatGPT (GPT-3.5) seems a good starter.

If it works correctly, we can try to generate a high quality spellcheck dataset using chatGPT (a dataset mapping text to correct into corrected text), and fine-tune an open source large language model we can host on our servers to replicate this feature.

Where to get the data?

The best way to get a list of products with ingredient list with error is to get the Open Food Facts JSONL dataset, and look for products with ingredient quality warnings.
The data quality warnings tags are available in the data_quality_warnings_tags field. Relevant tags for spotting ingredient lists with errors:

en:ingredients-unknown-score-above-0
en:ingredients-50-percent-unknown
en:ingredients-60-percent-unknown
en:ingredients-70-percent-unknown
en:ingredients-80-percent-unknown
en:ingredients-90-percent-unknown
...

Additional resources

Wiki page about ingredient data quality.

You can test if the corrected text is well-recognized by Open Food Facts server by using this link:
https://world.openfoodfacts.org/cgi/test_ingredients_analysis.pl?lc=it

Note that the lc=fr parameter is used to provide the language of the ingredient list, that is used to parse the ingredient list. If there are some unknown ingredients, it does not necessarily mean there is a spelling error, as some ingredients are not recognized, as they are not in our ingredient taxonomy. Ingredient coverage depends on the language (good for English and French, bad for low-resources languages).

Part of

Leverage Generative AI across Open Food Facts (tracker) #289

The text was updated successfully, but these errors were encountered:

raphael0202 added ✨ enhancement New feature or request ingredients spellcheck labels May 23, 2023

teolemon mentioned this issue May 23, 2023

Leverage Generative AI across Open Food Facts (tracker) #289

Open

raphael0202 transferred this issue from openfoodfacts/robotoff Aug 11, 2023

raphael0202 mentioned this issue Aug 11, 2023

Use LLMs to spellcheck/lowercase ingredient lists #297

Closed

jeremyarancio linked a pull request Apr 11, 2024 that will close this issue

feat: Spellcheck benchmark dataset and evaluation algorithm #340

Merged

jeremyarancio removed a link to a pull request Apr 11, 2024

feat: Spellcheck benchmark dataset and evaluation algorithm #340

Merged

teolemon added the LLMs label Apr 25, 2024

jeremyarancio self-assigned this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use large language models (LLM) to perform ingredient list spellcheck #314

Use large language models (LLM) to perform ingredient list spellcheck #314

raphael0202 commented May 23, 2023 •

edited by teolemon

Loading

Use large language models (LLM) to perform ingredient list spellcheck #314

Use large language models (LLM) to perform ingredient list spellcheck #314

Comments

raphael0202 commented May 23, 2023 • edited by teolemon Loading

Problem

Proposed solution

Where to get the data?

Additional resources

Part of

raphael0202 commented May 23, 2023 •

edited by teolemon

Loading