CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

UniverseFly · 2023-09-13T03:25:55Z

System Info

transformers version: 4.33.1
Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker, @younesbelkada, @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Here is the minimal reproducible example on my machine. There are several things to note here:

The assertion can pass by setting use_fast=False
The assertion can pass with num_proc=1
The assertion can pass by getting rid of the get_tokenize wrapper function and using tokenizer as a global variable.

Note

It may seem strange to define the get_tokenize wrapper here due to demonstration purposes, but my actual use case is more complex and get_tokenize can make the code more structured.

from transformers import (
    PreTrainedTokenizerFast,
    AutoTokenizer,
)
from datasets import Dataset

# Load a dataset with random text
dataset = Dataset.from_dict({"text": ["random text"] * 10})
# Load the fast tokenizer
tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
)
assert isinstance(tokenizer, PreTrainedTokenizerFast)

# Define a wrapper function that returns the tokenize function
def get_tokenize(tokenizer: PreTrainedTokenizerFast):
    def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
        text_list = [f"{text}</s>" for text in example["text"]]
        input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
        for ids in input_ids:
            assert ids[-1] == tokenizer.eos_token_id, ids
        return {"input_ids": input_ids}

    return tokenize

# Apply the wrapper function to tokenize the dataset and use 2 processes
tokenize = get_tokenize(tokenizer)
dataset = dataset.map(tokenize, batched=True, num_proc=2)

This code will trigger AssertionError: [4036, 1426, 829, 29879, 29958], meaning that the eos token </s> is separated into 3 tokens with ids [829, 29879, 29958]. They are mapped to '</', 's', '>'] respectively.

Expected behavior

The assertion should pass, i.e., the </s> token should be recognized as a single token.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-18T11:38:01Z

Hey! Thanks for opening an issue here. I'll see if this is related to the conversion or the fast tokenizer code.

ArthurZucker · 2023-09-22T00:09:00Z

Transferred it here as this is not related to the transformers library.
I think you should just be using the tokenizer without mapping the number of processes. Should by default use tokenizer parallelism.
See the following outputs:

Example script

In [4]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.35ba/s]

In [5]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=2)
#0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.24ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 50.57ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.64ba/s]

UniverseFly · 2023-09-22T00:35:40Z

Transferred it here as this is not related to the transformers library. I think you should just be using the tokenizer without mapping the number of processes. Should by default use tokenizer parallelism. See the following outputs:

Example script

In [4]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.35ba/s]

In [5]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=2)
#0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.24ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 50.57ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.64ba/s]

Got it. Thanks Arthur.

sparverius · 2023-10-20T19:55:06Z

TL;DR

@UniverseFly, Can confirm following the following changes resolves the above issue

info

Just ran into this bug with another model also using llama fast tokenizer & remembered one of the issues with Mistrals wrt llama tokenizer... @younesbelkada describes here huggingface/transformers#26498 (comment) in huggingface/transformers#26498

@lewtun fix https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/26/files

UniverseFly · 2023-10-20T19:56:54Z

Thanks @sparverius! I’ll close this issue then.

ArthurZucker transferred this issue from huggingface/transformers Sep 22, 2023

UniverseFly closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

UniverseFly commented Sep 13, 2023

ArthurZucker commented Sep 18, 2023

ArthurZucker commented Sep 22, 2023

UniverseFly commented Sep 22, 2023

sparverius commented Oct 20, 2023 •

edited

Loading

UniverseFly commented Oct 20, 2023

CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

Comments

UniverseFly commented Sep 13, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Sep 18, 2023

ArthurZucker commented Sep 22, 2023

UniverseFly commented Sep 22, 2023

sparverius commented Oct 20, 2023 • edited Loading

TL;DR

info

UniverseFly commented Oct 20, 2023

sparverius commented Oct 20, 2023 •

edited

Loading