Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TokenizerSlow] replace_additional_special_tokens is not doing much #24276

Closed
ArthurZucker opened this issue Jun 14, 2023 · 5 comments
Closed
Assignees
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@ArthurZucker
Copy link
Collaborator

Just flagging this as the add_special_tokens method got pretty complicated, adding a kwargs, replace_additional_special_tokens, that supposedly can prevent replacing the self._additional_special_tokens attribute.
For any tokenizer, this will remove it from the list, but will not update the internal trie and thus has no effect at all:

>>> from transformers import XLMRobertaTokenizer
>>> tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<//s>"]})
>>> tokenizer_a.additional_special_tokens
['<//s>']
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<///s>"]}, replace_additional_special_tokens= True)
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']

This will be addressed in #23909

@ArthurZucker ArthurZucker self-assigned this Jun 14, 2023
@ArthurZucker ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Jun 14, 2023
@ArthurZucker
Copy link
Collaborator Author

cc @ydshieh since you added the feature

@ydshieh
Copy link
Collaborator

ydshieh commented Jun 14, 2023

I don't fully understand what the code snippet above try to demonstrate.

But the origin of self._additional_special_tokens is from this issue #20418, where added_tokens_encoder will include all the added tokens, but additional_special_tokens is being replaced, which is really confusing behavior.

If you look the description in #20418, your code snippet does its job (although yes confusing).

The replace_additional_special_tokens with its default value True is just to make the behavior not too surprising, but keep the backward compatibility.

@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Jun 15, 2023

It was confusingly for me that the added tokens encoder is not updated.

yeah I know, but that's what it has been for years. (and I agree that the name of this introduced argument itself might be confusing too.)

That’s because maybe we should have a separate function just to say that’s we don’t want this token to be special anymore

If you have good idea to address the issue #20418 while reducing the (naming) confusion added in #20424, go ahead :-)

(sorry, I accidentally modified your message 😭 )

@huggingface huggingface deleted a comment from github-actions bot Jul 24, 2023
@ArthurZucker ArthurZucker reopened this Jul 24, 2023
@huggingface huggingface deleted a comment from github-actions bot Aug 17, 2023
@huggingface huggingface deleted a comment from github-actions bot Sep 11, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator Author

Closing as this is deprecated and changing the list of additional special tokens is a lot more involved than this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

2 participants