Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 #26773

dakinggg · 2023-10-13T02:53:01Z

System Info

transformers version: 4.34.0
Platform: macOS-13.5-arm64-arm-64bit
Python version: 3.10.12
Huggingface_hub version: 0.17.3
Safetensors version: 0.4.0
Accelerate version: 0.20.3
Accelerate config: not found
PyTorch version (GPU?): 2.1.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

In [1]: import transformers

In [2]: t0tt = transformers.AutoTokenizer.from_pretrained('bigscience/T0pp')
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

In [3]: t0tt.save_pretrained('saved-tokenizer')
Out[3]: 
('saved-tokenizer/tokenizer_config.json',
 'saved-tokenizer/special_tokens_map.json',
 'saved-tokenizer/spiece.model',
 'saved-tokenizer/added_tokens.json',
 'saved-tokenizer/tokenizer.json')

In [4]: loaded_t0tt = transformers.AutoTokenizer.from_pretrained('saved-tokenizer')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

In [6]: t0tt._eos_token
Out[6]: AddedToken("</s>", rstrip=True, lstrip=True, single_word=False, normalized=True, special=True)

In [7]: loaded_t0tt._eos_token
Out[7]: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)

In [8]: t0tt.eos_token
Out[8]: '</s>'

In [9]: t0tt('hello </s>        goodbye')
Out[9]: {'input_ids': [21820, 1, 23281, 1], 'attention_mask': [1, 1, 1, 1]}

In [10]: loaded_t0tt('hello </s>        goodbye')
Out[10]: {'input_ids': [21820, 3, 1, 23281, 1], 'attention_mask': [1, 1, 1, 1, 1]}

Expected behavior

When saving and loading a tokenizer, it
(1) behaves the same
(2) has the same config details on the AddedToken

The text was updated successfully, but these errors were encountered:

dakinggg · 2023-10-13T02:53:21Z

@ArthurZucker

ArthurZucker · 2023-10-13T10:33:24Z

Hey! Also fixed in #26570 THanks for reporting!

dakinggg changed the title ~~Saving and loading a tokenizer does not produce an identical tokenizer~~ Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 Oct 13, 2023

dakinggg mentioned this issue Oct 13, 2023

Upgrade to transformers 4.34.1 mosaicml/composer#2635

Merged

1 task

ArthurZucker mentioned this issue Oct 17, 2023

[Tokenizer] Fix slow and fast serialization #26570

Merged

ArthurZucker closed this as completed in #26570 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 #26773

Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 #26773

dakinggg commented Oct 13, 2023 •

edited by ArthurZucker

Loading

dakinggg commented Oct 13, 2023

ArthurZucker commented Oct 13, 2023

Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 #26773

Saving and loading a tokenizer does not produce an identical tokenizer in 4.34 #26773

Comments

dakinggg commented Oct 13, 2023 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

dakinggg commented Oct 13, 2023

ArthurZucker commented Oct 13, 2023

dakinggg commented Oct 13, 2023 •

edited by ArthurZucker

Loading