Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

Closed
2 of 4 tasks
Gompyn opened this issue Jan 14, 2023 · 15 comments · Fixed by #23909
Closed
2 of 4 tasks

PreTrainedTokenizer (slow) strip tokens that are around unique_no_split_tokens #21120

Gompyn opened this issue Jan 14, 2023 · 15 comments · Fixed by #23909
Assignees
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@Gompyn
Copy link

Gompyn commented Jan 14, 2023

System Info

  • transformers version: 4.24.0
  • Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. load a PreTrainedTokenizer that contains unique_no_split_tokens, e.g. EleutherAI/gpt-j-6B.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
  1. use the tokenizer to split a string that contains a unique_no_split_tokens, e.g. " <|extratoken_1|> ".
print(tokenizer(" <|extratoken_1|> ").input_ids)

Expected behavior

The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.

But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257

@Gompyn
Copy link
Author

Gompyn commented Jan 14, 2023

This is probably due to the following line, which is still not fixed in the HEAD.

else:
# We strip left and right by default
if right:
tokens[i + 1] = right.lstrip()
if left:
tokens[i - 1] = left.rstrip()

@Gompyn
Copy link
Author

Gompyn commented Jan 14, 2023

This bug strips away \n around my special token, making my model believe that there is no newline in my text.

@raghavanone
Copy link
Contributor

@ArthurZucker I can pick up this, Let me know what should be possible fix ?

@ArthurZucker
Copy link
Collaborator

There is indeed a discrepancy between the fast and slow version.
The problem here is that the tokens are indeed part of the no_split_tokens, but they are not AddedToken.
I am not really sure if the fast or slow has the correct behavior 😅

@ArthurZucker
Copy link
Collaborator

The cleanest way is to have the tokens as AddedTokens because you can handle the rstrip and lstripe arguments

@Gompyn
Copy link
Author

Gompyn commented Jan 23, 2023

@ArthurZucker I think decode(encode(text)) == text should be true by default, because some use cases (e.g. code generation) require the correct formatting of text. "Automatic formatting" should not be done by default to avoid breaking such use cases.
From another point of view, I guess most pre-trained models use a fast tokenizer (as the name fast implies), so these models also expect the behavior of the fast version.

@sgugger
Copy link
Collaborator

sgugger commented Jan 23, 2023

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

@Gompyn
Copy link
Author

Gompyn commented Jan 24, 2023

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

I agree that minimal unicode clean up is acceptable (mostly because that does not break my use cases), but whitespace simplification or text lowercasing is not by default enabled, so by default users do get a mostly conservative tokenizer.
But to add new tokens, the most simple way (add_tokens('mytoken') with special_tokens=False by default) in a slow tokenizer accidentally (from the view of a user) breaks this conservative behavior, and I think this is unexpected by users.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Gompyn
Copy link
Author

Gompyn commented Feb 18, 2023

Is there any progress on this issue? @ArthurZucker

@ArthurZucker
Copy link
Collaborator

Not yet! I finally have time so this week should be good!

@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Mar 14, 2023
@Gompyn
Copy link
Author

Gompyn commented Apr 10, 2023

Is there any progress on this issue?

@huggingface huggingface deleted a comment from github-actions bot Apr 11, 2023
@huggingface huggingface deleted a comment from github-actions bot May 25, 2023
@ArthurZucker ArthurZucker reopened this May 25, 2023
@ArthurZucker
Copy link
Collaborator

Hey, to follow progress is suggest you check #23909, which should try to adresse this.

@ArthurZucker
Copy link
Collaborator

Quick update, this is gonna take a bit more time as a more in-depth refactoring is needed

@ArthurZucker
Copy link
Collaborator

PR will be merged this week! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
4 participants