[TokenizerSlow] `replace_additional_special_tokens` is not doing much #24276

ArthurZucker · 2023-06-14T13:16:39Z

Just flagging this as the add_special_tokens method got pretty complicated, adding a kwargs, replace_additional_special_tokens, that supposedly can prevent replacing the self._additional_special_tokens attribute.
For any tokenizer, this will remove it from the list, but will not update the internal trie and thus has no effect at all:

>>> from transformers import XLMRobertaTokenizer
>>> tokenizer_a = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<//s>"]})
>>> tokenizer_a.additional_special_tokens
['<//s>']
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']
>>> tokenizer_a.add_special_tokens({"additional_special_tokens":["<///s>"]}, replace_additional_special_tokens= True)
>>> print(tokenizer_a.tokenize("This is a <//s>"))
['▁This', '▁is', '▁a', '<//s>']

This will be addressed in #23909

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-06-14T14:22:59Z

cc @ydshieh since you added the feature

ydshieh · 2023-06-14T15:22:06Z

~~I don't fully understand what the code snippet above try to demonstrate.~~

But the origin of self._additional_special_tokens is from this issue #20418, where added_tokens_encoder will include all the added tokens, but additional_special_tokens is being replaced, which is really confusing behavior.

If you look the description in #20418, your code snippet does its job (although yes confusing).

The replace_additional_special_tokens with its default value True is just to make the behavior not too surprising, but keep the backward compatibility.

ArthurZucker · 2023-06-15T06:46:50Z

It was confusingly for me that the added tokens encoder is not updated.

yeah I know, but that's what it has been for years. (and I agree that the name of this introduced argument itself might be confusing too.)

That’s because maybe we should have a separate function just to say that’s we don’t want this token to be special anymore

If you have good idea to address the issue #20418 while reducing the (naming) confusion added in #20424, go ahead :-)

(sorry, I accidentally modified your message 😭 )

github-actions · 2023-10-11T08:09:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2023-10-11T10:14:09Z

Closing as this is deprecated and changing the list of additional special tokens is a lot more involved than this

ArthurZucker self-assigned this Jun 14, 2023

ArthurZucker added the Core: Tokenization Internals of the library; Tokenization. label Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TokenizerSlow] `replace_additional_special_tokens` is not doing much #24276

[TokenizerSlow] `replace_additional_special_tokens` is not doing much #24276

ArthurZucker commented Jun 14, 2023

ArthurZucker commented Jun 14, 2023

ydshieh commented Jun 14, 2023 •

edited

Loading

ArthurZucker commented Jun 15, 2023 •

edited by ydshieh

Loading

github-actions bot commented Oct 11, 2023

ArthurZucker commented Oct 11, 2023

[TokenizerSlow] replace_additional_special_tokens is not doing much #24276

[TokenizerSlow] replace_additional_special_tokens is not doing much #24276

Comments

ArthurZucker commented Jun 14, 2023

ArthurZucker commented Jun 14, 2023

ydshieh commented Jun 14, 2023 • edited Loading

ArthurZucker commented Jun 15, 2023 • edited by ydshieh Loading

github-actions bot commented Oct 11, 2023

ArthurZucker commented Oct 11, 2023

[TokenizerSlow] `replace_additional_special_tokens` is not doing much #24276

[TokenizerSlow] `replace_additional_special_tokens` is not doing much #24276

ydshieh commented Jun 14, 2023 •

edited

Loading

ArthurZucker commented Jun 15, 2023 •

edited by ydshieh

Loading