Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert tokens to strings before partial matching #194

Merged
merged 1 commit into from
Jul 23, 2023

Conversation

rlouf
Copy link
Member

@rlouf rlouf commented Jul 17, 2023

BPE tokenizers encode whitespaces as a special character Ġ, and it is possible that another tokenizer may encode one or several strings as special character(s). Partial matching with the tokens directly thus prevents relevant matches from happening.

We thus add a convert_token_to_string method that converts back these special character to the string they correspond to so we can partially match the corresponding tokens with the regex.

The following snippet illustrates the problem and its solution:

import interegular

from outlines.text.parsing import find_partial_matches
from outlines.models.transformers import transformers

model = transformers("gpt2")

fsm = interegular.parse_pattern("This is a new day").to_fsm()
pmatch = find_partial_matches(fsm, "Ġday")
print(pmatch)
# set()

pmatch = find_partial_matches(fsm, model.tokenizer.convert_token_to_string("Ġday"))
print(pmatch)
# {(3, (13, 14, 15, 16, 17))}

Note: we could also iterate over the token ids and use tokenizer.decode. This may be simpler than adding a new function, especially since we wouldn't have to sort the vocabulary. But maybe less explicit as this may leave people wondering why we don't use vocabulary.keys() directly.

BPE tokenizers encode whitespaces as a special character, and it is
possible that another tokenizer may encode one or several strings as
special character(s).

We thus add a `convert_token_to_string` that converts back these special
character to the string they correspond to so we can partially match
the corresponding tokens with the regex.
@rlouf rlouf added text Linked to text generation bug structured generation Linked to structured generation labels Jul 17, 2023
@rlouf rlouf added this to the 0.1 milestone Jul 19, 2023
@rlouf rlouf merged commit a4a0868 into outlines-dev:main Jul 23, 2023
4 checks passed
@rlouf rlouf deleted the convert-tokens-to-strings branch July 23, 2023 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug structured generation Linked to structured generation text Linked to text generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant