Draft: Update Tokenizer Overrides Handling in models.py #1549

mhenrichsen · 2024-04-19T12:43:40Z

Example:

tokenizer_overrides:
  - 28006: <|im_start|>
  - 28007: <|im_end|>

Description:

This PR introduces an enhancement to the way we handle tokenizer overrides in our models.py file.

Previously, the code did not account for the possibility of a token override not being found in the tokenizer's special tokens. This could potentially lead to silent failures where an override is skipped without any indication to the user.

The updated code now includes a check for each key in the cfg.tokenizer_overrides dictionary within the tokenizer.all_special_tokens list. If a match is found, the corresponding token's content is updated with the override value.

This change ensures that our tokenizer correctly applies all specified overrides, improving the robustness and reliability of our tokenization process.

winglian

thanks!! this was on my todo list too.

NanoCode012 · 2024-04-30T14:26:04Z

Could a unit test be added for this?

winglian · 2024-05-30T17:40:24Z

@mhenrichsen played around with this today. The tokenizer seems to be frozen, especially for llama 3 where this would be more useful.

override special tokens mock code

52f6fa2

mhenrichsen changed the title ~~Update Tokenizer Overrides Handling in models.py~~ MOCK: Update Tokenizer Overrides Handling in models.py Apr 19, 2024

mhenrichsen changed the title ~~MOCK: Update Tokenizer Overrides Handling in models.py~~ Draft: Update Tokenizer Overrides Handling in models.py Apr 19, 2024

winglian approved these changes Apr 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Update Tokenizer Overrides Handling in models.py #1549

Draft: Update Tokenizer Overrides Handling in models.py #1549

mhenrichsen commented Apr 19, 2024

winglian left a comment

NanoCode012 commented Apr 30, 2024

winglian commented May 30, 2024

Draft: Update Tokenizer Overrides Handling in models.py #1549

Are you sure you want to change the base?

Draft: Update Tokenizer Overrides Handling in models.py #1549

Conversation

mhenrichsen commented Apr 19, 2024

winglian left a comment

Choose a reason for hiding this comment

NanoCode012 commented Apr 30, 2024

winglian commented May 30, 2024