Convert tokens to strings before partial matching #194

rlouf · 2023-07-17T14:48:27Z

BPE tokenizers encode whitespaces as a special character Ġ, and it is possible that another tokenizer may encode one or several strings as special character(s). Partial matching with the tokens directly thus prevents relevant matches from happening.

We thus add a convert_token_to_string method that converts back these special character to the string they correspond to so we can partially match the corresponding tokens with the regex.

The following snippet illustrates the problem and its solution:

import interegular

from outlines.text.parsing import find_partial_matches
from outlines.models.transformers import transformers

model = transformers("gpt2")

fsm = interegular.parse_pattern("This is a new day").to_fsm()
pmatch = find_partial_matches(fsm, "Ġday")
print(pmatch)
# set()

pmatch = find_partial_matches(fsm, model.tokenizer.convert_token_to_string("Ġday"))
print(pmatch)
# {(3, (13, 14, 15, 16, 17))}

Note: we could also iterate over the token ids and use tokenizer.decode. This may be simpler than adding a new function, especially since we wouldn't have to sort the vocabulary. But maybe less explicit as this may leave people wondering why we don't use vocabulary.keys() directly.

BPE tokenizers encode whitespaces as a special character, and it is possible that another tokenizer may encode one or several strings as special character(s). We thus add a `convert_token_to_string` that converts back these special character to the string they correspond to so we can partially match the corresponding tokens with the regex.

rlouf added text Linked to text generation bug structured generation Linked to structured generation labels Jul 17, 2023

rlouf added this to the 0.1 milestone Jul 19, 2023

rlouf merged commit a4a0868 into outlines-dev:main Jul 23, 2023
4 checks passed

rlouf deleted the convert-tokens-to-strings branch July 23, 2023 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert tokens to strings before partial matching #194

Convert tokens to strings before partial matching #194

rlouf commented Jul 17, 2023 •

edited

Loading

Convert tokens to strings before partial matching #194

Convert tokens to strings before partial matching #194

Conversation

rlouf commented Jul 17, 2023 • edited Loading

rlouf commented Jul 17, 2023 •

edited

Loading