Tokenizer: fix issue when decoding a single token at a time #1559

Andrei-Aksionov · 2024-07-07T14:40:12Z

Hi there 👋

When I was working on Phi-3 integration I've spotted an issue with HF Tokenizer that if to decode a single token at time (like in chat script), then it drops all the spaces and the output looks like a single looooooong string (huggingface/transformers#31643).
So, instead of This is a test string the code prints Thisisateststring.

As was mentioned in the issue by @itazap, the problem happens with LlamaTokenizer and thus affects not only Phi-3, but other models too.

This PR applies the hack from Phi-3 to other models.

litgpt/tokenizer.py

rasbt

Thanks for the fix!

Assert that the issue comes from LlamaTokenizerFast

4bb219e

Andrei-Aksionov added the tokenization label Jul 7, 2024

Andrei-Aksionov added 2 commits July 7, 2024 18:29

Apply decoding fix to all models with a LlaMA tokenizer

5f01a7a

Fix when tokenizer_config.json doesn't exist

adb1aa9

Andrei-Aksionov marked this pull request as ready for review July 7, 2024 16:35

Andrei-Aksionov requested review from awaelchli and lantiga as code owners July 7, 2024 16:35

Andrei-Aksionov commented Jul 7, 2024

View reviewed changes

litgpt/tokenizer.py Show resolved Hide resolved

Andrei-Aksionov requested a review from rasbt July 7, 2024 19:54

rasbt reviewed Jul 8, 2024

View reviewed changes

litgpt/tokenizer.py Show resolved Hide resolved

rasbt approved these changes Jul 8, 2024

View reviewed changes

Andrei-Aksionov merged commit 7f75e01 into main Jul 8, 2024
9 checks passed

Andrei-Aksionov deleted the tokenizer_single_token_decoding branch July 8, 2024 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer: fix issue when decoding a single token at a time #1559

Tokenizer: fix issue when decoding a single token at a time #1559

Andrei-Aksionov commented Jul 7, 2024

rasbt left a comment

Tokenizer: fix issue when decoding a single token at a time #1559

Tokenizer: fix issue when decoding a single token at a time #1559

Conversation

Andrei-Aksionov commented Jul 7, 2024

rasbt left a comment

Choose a reason for hiding this comment