`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

Gompyn · 2023-01-14T06:51:03Z

System Info

transformers version: 4.24.0
Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
Python version: 3.10.8
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

load a PreTrainedTokenizer that contains unique_no_split_tokens, e.g. EleutherAI/gpt-j-6B.

tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')

use the tokenizer to split a string that contains a unique_no_split_tokens, e.g. " <|extratoken_1|> ".

print(tokenizer(" <|extratoken_1|> ").input_ids)

Expected behavior

The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.

But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257

The text was updated successfully, but these errors were encountered:

Gompyn · 2023-01-14T06:55:33Z

This is probably due to the following line, which is still not fixed in the HEAD.

transformers/src/transformers/tokenization_utils.py

Lines 532 to 537 in f58248b

    
           else: 
        
               # We strip left and right by default 
        
               if right: 
        
                   tokens[i + 1] = right.lstrip() 
        
               if left: 
        
                   tokens[i - 1] = left.rstrip()

Gompyn · 2023-01-14T06:58:50Z

This bug strips away \n around my special token, making my model believe that there is no newline in my text.

raghavanone · 2023-01-22T12:22:29Z

@ArthurZucker I can pick up this, Let me know what should be possible fix ?

ArthurZucker · 2023-01-23T15:09:11Z

There is indeed a discrepancy between the fast and slow version.
The problem here is that the tokens are indeed part of the no_split_tokens, but they are not AddedToken.
I am not really sure if the fast or slow has the correct behavior 😅

ArthurZucker · 2023-01-23T15:11:05Z

The cleanest way is to have the tokens as AddedTokens because you can handle the rstrip and lstripe arguments

Gompyn · 2023-01-23T17:39:46Z

@ArthurZucker I think decode(encode(text)) == text should be true by default, because some use cases (e.g. code generation) require the correct formatting of text. "Automatic formatting" should not be done by default to avoid breaking such use cases.
From another point of view, I guess most pre-trained models use a fast tokenizer (as the name fast implies), so these models also expect the behavior of the fast version.

sgugger · 2023-01-23T18:12:46Z

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

Gompyn · 2023-01-24T01:40:25Z

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

I agree that minimal unicode clean up is acceptable (mostly because that does not break my use cases), but whitespace simplification or text lowercasing is not by default enabled, so by default users do get a mostly conservative tokenizer.
But to add new tokens, the most simple way (add_tokens('mytoken') with special_tokens=False by default) in a slow tokenizer accidentally (from the view of a user) breaks this conservative behavior, and I think this is unexpected by users.

github-actions · 2023-02-17T15:02:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Gompyn · 2023-02-18T16:57:58Z

Is there any progress on this issue? @ArthurZucker

ArthurZucker · 2023-02-20T07:55:43Z

Not yet! I finally have time so this week should be good!

Gompyn · 2023-04-10T09:56:07Z

Is there any progress on this issue?

ArthurZucker · 2023-06-01T11:22:35Z

Hey, to follow progress is suggest you check #23909, which should try to adresse this.

ArthurZucker · 2023-06-26T04:18:20Z

Quick update, this is gonna take a bit more time as a more in-depth refactoring is needed

ArthurZucker · 2023-08-16T07:39:47Z

PR will be merged this week! 🤗

susnato mentioned this issue Jan 18, 2023

CodeGen Tokenizer Deletes Newline Symbols #21161

Closed

4 tasks

ArthurZucker self-assigned this Jan 19, 2023

ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Mar 14, 2023

huggingface deleted a comment from github-actions bot Apr 11, 2023

github-actions bot closed this as completed May 14, 2023

huggingface deleted a comment from github-actions bot May 25, 2023

ArthurZucker reopened this May 25, 2023

ArthurZucker mentioned this issue Jun 1, 2023

🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 #23909

Merged

huggingface deleted a comment from github-actions bot Jun 26, 2023

huggingface deleted a comment from github-actions bot Jul 20, 2023

ozreact mentioned this issue Aug 3, 2023

Tokenizer failing to encode chatml correctly #25304

Closed

4 tasks

huggingface deleted a comment from github-actions bot Aug 16, 2023

huggingface deleted a comment from github-actions bot Sep 9, 2023

ArthurZucker closed this as completed in #23909 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

Gompyn commented Jan 14, 2023

Gompyn commented Jan 14, 2023

Gompyn commented Jan 14, 2023

raghavanone commented Jan 22, 2023

ArthurZucker commented Jan 23, 2023

ArthurZucker commented Jan 23, 2023

Gompyn commented Jan 23, 2023 •

edited

Loading

sgugger commented Jan 23, 2023

Gompyn commented Jan 24, 2023 •

edited

Loading

github-actions bot commented Feb 17, 2023

Gompyn commented Feb 18, 2023

ArthurZucker commented Feb 20, 2023

Gompyn commented Apr 10, 2023

ArthurZucker commented Jun 1, 2023

ArthurZucker commented Jun 26, 2023

ArthurZucker commented Aug 16, 2023

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

Comments

Gompyn commented Jan 14, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Gompyn commented Jan 14, 2023

Gompyn commented Jan 14, 2023

raghavanone commented Jan 22, 2023

ArthurZucker commented Jan 23, 2023

ArthurZucker commented Jan 23, 2023

Gompyn commented Jan 23, 2023 • edited Loading

sgugger commented Jan 23, 2023

Gompyn commented Jan 24, 2023 • edited Loading

github-actions bot commented Feb 17, 2023

Gompyn commented Feb 18, 2023

ArthurZucker commented Feb 20, 2023

Gompyn commented Apr 10, 2023

ArthurZucker commented Jun 1, 2023

ArthurZucker commented Jun 26, 2023

ArthurZucker commented Aug 16, 2023

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens` #21120

Gompyn commented Jan 23, 2023 •

edited

Loading

Gompyn commented Jan 24, 2023 •

edited

Loading