Regex FSM fails with some tokenizers #762

saattrupdan · 2024-03-20T15:53:41Z

Describe the issue as clearly as possible:

When running a regex FSM with some tokenizers, the reduced_vocabulary function fails due to the existence of a token ▁� in the vocabulary. This includes, but is not limited to, the following model types:

All Gemma models
All GPT-Sw3 models

The following models are not affected by this:

Mistral models
LLaMA models
OLMo models
Qwen models
Phi models
Yi models
GPT-2 models

This was implemented in this commit as part of this PR. Tagging @ksvladimir here as he might be able to help resolve this 🙂

Note that this bug happens when installing from the main branch and not any release, but I include it here to be fixed before the next release (ideally)

Steps/code to reproduce the bug:

from outlines.fsm.regex import reduced_vocabulary
from outlines.integrations.utils import adapt_tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/gpt-sw3-6.7b-v2")
tokenizer = adapt_tokenizer(tokenizer)
vocabulary = reduced_vocabulary(tokenizer)

Expected result:

No error message.

Error message:

RuntimeError                              Traceback (most recent call last)
Cell In [45], line 1
----> 1 vocab = reduced_vocabulary(tok)

File ~/.venv/lib/python3.10/site-packages/outlines/fsm/regex.py:793, in reduced_vocabulary(tokenizer)
    789         token_bytes = cast(
    790             List[int], [gpt2_unicode_to_bytes().get(c) for c in token]
    791         )
    792         if None in token_bytes:
--> 793             raise RuntimeError(
    794                 f"Cannot convert token `{token}` ({token_idx}) to bytes: {token_str}"
    795             )
    796     token_str = tuple(byte_symbol(b) for b in token_bytes)
    798 vocabulary.setdefault(token_str, []).append(token_idx)

RuntimeError: Cannot convert token `▁�` (6546) to bytes:  �

Outlines/Python version information:

Version information

0.0.37.dev8+gf7cafe5
Python 3.10.13
annotated-types==0.6.0
asttokens==2.4.1
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
cloudpickle==3.0.0
decorator==5.1.1
diskcache==5.6.3
exceptiongroup==1.2.0
executing==2.0.1
filelock==3.13.1
fsspec==2024.3.1
huggingface-hub==0.21.4
idna==3.6
interegular==0.3.3
ipython==8.22.2
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.6
mpmath==1.3.0
nest-asyncio==1.6.0
networkx==3.2.1
numba==0.59.1
numpy==1.26.4
outlines @ git+https://github.com/outlines-dev/outlines@f7cafe5f60b26af2fa62f9dec19ee6dd84745ced
packaging==24.0
parso==0.8.3
pexpect==4.9.0
prompt-toolkit==3.0.43
ptyprocess==0.7.0
pure-eval==0.2.2
pydantic==2.6.4
pydantic_core==2.16.3
Pygments==2.17.2
PyYAML==6.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
six==1.16.0
stack-data==0.6.3
sympy==1.12
tokenizers==0.15.2
torch==2.2.1
tqdm==4.66.2
traitlets==5.14.2
transformers==4.38.2
typing_extensions==4.10.0
urllib3==2.2.1
wcwidth==0.2.13

Context for the issue:

Prevents the use of RegexGuide with the above-mentioned models.

The text was updated successfully, but these errors were encountered:

ksvladimir · 2024-03-20T18:17:10Z

Looks like such models use different encoding for incomplete UTF-8 sequences. I will do my best to try and find some time to take a look in the near future.

The ideal and universal solution for this would be to have huggingface tokenizers expose decode_bytes function, which would simply decode incomplete sequences as raw bytes (I assume this is what underlying tokenizers do anyway - followed by decoding bytes from utf8). In the absence of that, we have to revert to hacks that are specific to tokenizer types.

This PR is an extension of #763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue #762. This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.

saattrupdan added the bug label Mar 20, 2024

ai-and-i mentioned this issue Mar 20, 2024

fixed parsing token vocabularies for gemma and gpt-sw3 models #763

Merged

rlouf closed this as completed in #763 Mar 25, 2024

This was referenced Jun 9, 2024

[MODEL EVALUATION REQUEST] NorwAI/NorwAI-Mistral-7B-instruct ScandEval/ScandEval#453

Closed

Fix/extend re replacement seq #948

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex FSM fails with some tokenizers #762

Regex FSM fails with some tokenizers #762

saattrupdan commented Mar 20, 2024 •

edited

Loading

ksvladimir commented Mar 20, 2024

Regex FSM fails with some tokenizers #762

Regex FSM fails with some tokenizers #762

Comments

saattrupdan commented Mar 20, 2024 • edited Loading

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

ksvladimir commented Mar 20, 2024

saattrupdan commented Mar 20, 2024 •

edited

Loading