Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support indicating prefix token of chat template #28473

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

congchan
Copy link

@congchan congchan commented Jan 12, 2024

What does this PR do?

In chat language model training, sometimes we need to mask the input from real users, and train the model solely from assistant's outputs.

This PR add a special prefix token, which can be applied in chat_template, so that we can make use of this prefix_token to dynamically separate dialogs from user and assistant.

For example:

"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

The prefix_token could be <|im_start|>assistant\n, we can make use of this token:

  • to set the model's chat_template, for example {% if add_generation_prompt %}{{ prefix_token }}
  • To separate the dialogs from user's and model's turns, and mask the loss from user's turns, by access tokenizer.prefix_token and tokenizer.eos_token

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker
Copy link
Collaborator

cc @Rocketknight1

@huggingface huggingface deleted a comment from github-actions bot Feb 12, 2024
@Rocketknight1
Copy link
Member

Hi @congchan - firstly, apologies for taking so long to get to this one - it slipped past me the first time I was pinged! This seems like a clean PR, but I'm not sure we can accept it as-is: The list of special tokens that we have specific code for is very short, and I think this would make more sense as an added token in models that support it, since most models will not.

However, you're not the only user who wants a clean way to separate user and assistant messages in the tokens from apply_chat_template. Another user has suggested getting the method to return an optional mask array (similar to attention_mask), which you could use to mask assistant/user messages: #28950

@congchan
Copy link
Author

Hi @congchan - firstly, apologies for taking so long to get to this one - it slipped past me the first time I was pinged! This seems like a clean PR, but I'm not sure we can accept it as-is: The list of special tokens that we have specific code for is very short, and I think this would make more sense as an added token in models that support it, since most models will not.

However, you're not the only user who wants a clean way to separate user and assistant messages in the tokens from apply_chat_template. Another user has suggested getting the method to return an optional mask array (similar to attention_mask), which you could use to mask assistant/user messages: #28950

Hi, thanks for your feedback. Indeed it is better to keep the special tokens shorts.

Besides, I suggest apply_chat_template with tokenize=True takes in accounts for "weight" or "mask" key in the input list of json, to provide the most flexible end-to-end tokenization, which unify both single turn and multi-turn chats tuning.

The reason is, in production environment with multi-turns dataset curated or bad case hot fixing, we can modify some specific turns to become high-quality without changing the rest of the other turns.

User can choose to train their model to learn only specific turns that they believe to be high quality, and ignore others.
e.g.s..:

chat = [
  {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate", "weight": 1.0},
  {"role": "user", "content": "Hello, how are you?", "weight": 0.0},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?", "weight": 1.0},
  {"role": "user", "content": "Cool, and who are you?", "weight": 0.0},
  {"role": "assistant", "content": "I'm ChatGPT.", "weight": 0.0},
  ....
  {"role": "user", "content": "Which is bigger, a virus or a bacterium?", "weight": 0.0},
  {"role": "assistant", "content": "A bacterium.", "weight": 1.0}
]

tokenizer.apply_chat_template(chat, tokenize=True) Will set labels for those turns with "weight": 0.0 to ignore_index.

I have already been using this in-out pipeline in my local training(but not yet make use of the apply_chat_template).

What do you think? I can also help on it.

@huggingface huggingface deleted a comment from github-actions bot Mar 11, 2024
@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants