Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error when decoding a token in the id gap (or out of range) in a tiktoken tokenizer #841

Merged
merged 5 commits into from
Jan 8, 2024

Conversation

dakinggg
Copy link
Collaborator

@dakinggg dakinggg commented Jan 7, 2024

Previously, if you tried to decode an invalid token, the tokenizer would crash. For a trained model, this would never be an issue because it would not produce invalid tokens, but for an untrained model, it could randomly produce any token up to the embedding size. This PR has the tokenizer instead return empty string for these invalid tokens.

This behavior matches some Hugging Face fast tokenizers, but not slow tokenizers, which error on out of range indices. Even though the tiktoken wrapper is technically a slow tokenizer, the fast behavior seems better to avoid crashes on random models.

It also adds a utf-8 encoding to two file open calls. See mosaicml/composer#2824 for more details.

@dakinggg dakinggg marked this pull request as ready for review January 7, 2024 23:09
Copy link
Contributor

@sashaDoubov sashaDoubov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! this is very useful for debugging + thank you for digging into the utf-8 encoding issue.

llmfoundry/tokenizers/tiktoken.py Show resolved Hide resolved
@dakinggg dakinggg merged commit 5b99488 into mosaicml:main Jan 8, 2024
10 checks passed
@dakinggg dakinggg deleted the tiktoken-gap branch February 10, 2024 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants