Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343

Closed
2 of 4 tasks
UniverseFly opened this issue Sep 13, 2023 · 5 comments
Closed
2 of 4 tasks

Comments

@UniverseFly
Copy link

System Info

  • transformers version: 4.33.1
  • Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.3
  • Accelerate version: 0.22.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker, @younesbelkada, @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here is the minimal reproducible example on my machine. There are several things to note here:

  • The assertion can pass by setting use_fast=False
  • The assertion can pass with num_proc=1
  • The assertion can pass by getting rid of the get_tokenize wrapper function and using tokenizer as a global variable.

Note

It may seem strange to define the get_tokenize wrapper here due to demonstration purposes, but my actual use case is more complex and get_tokenize can make the code more structured.

from transformers import (
    PreTrainedTokenizerFast,
    AutoTokenizer,
)
from datasets import Dataset

# Load a dataset with random text
dataset = Dataset.from_dict({"text": ["random text"] * 10})
# Load the fast tokenizer
tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
    "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
)
assert isinstance(tokenizer, PreTrainedTokenizerFast)

# Define a wrapper function that returns the tokenize function
def get_tokenize(tokenizer: PreTrainedTokenizerFast):
    def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
        text_list = [f"{text}</s>" for text in example["text"]]
        input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
        for ids in input_ids:
            assert ids[-1] == tokenizer.eos_token_id, ids
        return {"input_ids": input_ids}

    return tokenize

# Apply the wrapper function to tokenize the dataset and use 2 processes
tokenize = get_tokenize(tokenizer)
dataset = dataset.map(tokenize, batched=True, num_proc=2)

This code will trigger AssertionError: [4036, 1426, 829, 29879, 29958], meaning that the eos token </s> is separated into 3 tokens with ids [829, 29879, 29958]. They are mapped to '</', 's', '>'] respectively.

Expected behavior

The assertion should pass, i.e., the </s> token should be recognized as a single token.

@ArthurZucker
Copy link
Collaborator

Hey! Thanks for opening an issue here. I'll see if this is related to the conversion or the fast tokenizer code.

@ArthurZucker ArthurZucker transferred this issue from huggingface/transformers Sep 22, 2023
@ArthurZucker
Copy link
Collaborator

Transferred it here as this is not related to the transformers library.
I think you should just be using the tokenizer without mapping the number of processes. Should by default use tokenizer parallelism.
See the following outputs:

Example script
In [4]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.35ba/s]

In [5]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=2)
#0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.24ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 50.57ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.64ba/s]

@UniverseFly
Copy link
Author

Transferred it here as this is not related to the transformers library. I think you should just be using the tokenizer without mapping the number of processes. Should by default use tokenizer parallelism. See the following outputs:

Example script

In [4]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.35ba/s]

In [5]: from transformers import (
   ...:     PreTrainedTokenizerFast,
   ...:     AutoTokenizer,
   ...: )
   ...: from datasets import Dataset
   ...:
   ...: # Load a dataset with random text
   ...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
   ...: # Load the fast tokenizer
   ...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
   ...:     "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
   ...: )
   ...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
   ...:
   ...: # Define a wrapper function that returns the tokenize function
   ...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
   ...:     def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
   ...:         text_list = [f"{text}</s>" for text in example["text"]]
   ...:         input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
   ...:         return {"input_ids": input_ids}
   ...:
   ...:     return tokenize
   ...:
   ...: # Apply the wrapper function to tokenize the dataset and use 2 processes
   ...: tokenize = get_tokenize(tokenizer)
   ...: dataset = dataset.map(tokenize, batched=True, num_proc=2)
#0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.24ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 50.57ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.64ba/s]

Got it. Thanks Arthur.

@sparverius
Copy link

sparverius commented Oct 20, 2023

TL;DR

@UniverseFly, Can confirm following the following changes resolves the above issue

info

Just ran into this bug with another model also using llama fast tokenizer & remembered one of the issues with Mistrals wrt llama tokenizer... @younesbelkada describes here huggingface/transformers#26498 (comment) in huggingface/transformers#26498

@lewtun fix https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/26/files

snip

@UniverseFly
Copy link
Author

Thanks @sparverius! I’ll close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants