feat: support indicating prefix token of chat template #28473

congchan · 2024-01-12T09:36:48Z

What does this PR do?

In chat language model training, sometimes we need to mask the input from real users, and train the model solely from assistant's outputs.

This PR add a special prefix token, which can be applied in chat_template, so that we can make use of this prefix_token to dynamically separate dialogs from user and assistant.

For example:

"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

The prefix_token could be <|im_start|>assistant\n, we can make use of this token:

to set the model's chat_template, for example {% if add_generation_prompt %}{{ prefix_token }}
To separate the dialogs from user's and model's turns, and mask the loss from user's turns, by access tokenizer.prefix_token and tokenizer.eos_token

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ArthurZucker · 2024-01-15T15:01:31Z

cc @Rocketknight1

Rocketknight1 · 2024-02-12T13:36:40Z

Hi @congchan - firstly, apologies for taking so long to get to this one - it slipped past me the first time I was pinged! This seems like a clean PR, but I'm not sure we can accept it as-is: The list of special tokens that we have specific code for is very short, and I think this would make more sense as an added token in models that support it, since most models will not.

However, you're not the only user who wants a clean way to separate user and assistant messages in the tokens from apply_chat_template. Another user has suggested getting the method to return an optional mask array (similar to attention_mask), which you could use to mask assistant/user messages: #28950

congchan · 2024-02-13T08:57:17Z

Hi @congchan - firstly, apologies for taking so long to get to this one - it slipped past me the first time I was pinged! This seems like a clean PR, but I'm not sure we can accept it as-is: The list of special tokens that we have specific code for is very short, and I think this would make more sense as an added token in models that support it, since most models will not.

However, you're not the only user who wants a clean way to separate user and assistant messages in the tokens from apply_chat_template. Another user has suggested getting the method to return an optional mask array (similar to attention_mask), which you could use to mask assistant/user messages: #28950

Hi, thanks for your feedback. Indeed it is better to keep the special tokens shorts.

Besides, I suggest apply_chat_template with tokenize=True takes in accounts for "weight" or "mask" key in the input list of json, to provide the most flexible end-to-end tokenization, which unify both single turn and multi-turn chats tuning.

The reason is, in production environment with multi-turns dataset curated or bad case hot fixing, we can modify some specific turns to become high-quality without changing the rest of the other turns.

User can choose to train their model to learn only specific turns that they believe to be high quality, and ignore others.
e.g.s..:

chat = [
  {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", "weight": 1.0},
  {"role": "user", "content": "Hello, how are you?", "weight": 0.0},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?", "weight": 1.0},
  {"role": "user", "content": "Cool, and who are you?", "weight": 0.0},
  {"role": "assistant", "content": "I'm ChatGPT.", "weight": 0.0},
  ....
  {"role": "user", "content": "Which is bigger, a virus or a bacterium?", "weight": 0.0},
  {"role": "assistant", "content": "A bacterium.", "weight": 1.0}
]

tokenizer.apply_chat_template(chat, tokenize=True) Will set labels for those turns with "weight": 0.0 to ignore_index.

I have already been using this in-out pipeline in my local training(but not yet make use of the apply_chat_template).

What do you think? I can also help on it.

feat: support indicating prefix of chat generation

c2ccc2d

congchan mentioned this pull request Jan 24, 2024

feat: train with template lm-sys/FastChat#2951

Merged

3 tasks

huggingface deleted a comment from github-actions bot Feb 12, 2024

huggingface deleted a comment from github-actions bot Mar 11, 2024

ArthurZucker added the Feature request Request for a new feature label Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support indicating prefix token of chat template #28473

feat: support indicating prefix token of chat template #28473

congchan commented Jan 12, 2024 •

edited

Loading

ArthurZucker commented Jan 15, 2024

Rocketknight1 commented Feb 12, 2024

congchan commented Feb 13, 2024

feat: support indicating prefix token of chat template #28473

Are you sure you want to change the base?

feat: support indicating prefix token of chat template #28473

Conversation

congchan commented Jan 12, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker commented Jan 15, 2024

Rocketknight1 commented Feb 12, 2024

congchan commented Feb 13, 2024

congchan commented Jan 12, 2024 •

edited

Loading