Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing prompt template of chatml by removal of linebreak #922

Merged
merged 1 commit into from
Dec 9, 2023

Conversation

timothylimyl
Copy link
Contributor

reference to discord chat:
"""
hi <@208256080092856321> , went through the code step-by-step and found that there's an extra linebreak ('\n') in the separator of ChatML so the separator will end up having two linebreaks . You can see it here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/a581e9f8f66e14c22ec914ee792dd4fe073e62f6/src/axolotl/prompt_strategies/sharegpt.py#L16 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/monkeypatch/fastchat_conversation_turns.py#L117 . A typical silent killer of prompt template for those not aware but the model is most probably robust enough to still reply coherently since it's just a linebreak.
"""

@casper-hansen
Copy link
Collaborator

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

@winglian
Copy link
Collaborator

winglian commented Dec 7, 2023

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy

Screenshot_2023-12-07_at_10_21_45_AM

@casper-hansen
Copy link
Collaborator

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy

Screenshot_2023-12-07_at_10_21_45_AM

Oh that's really bad. Essentially you are putting in a sample with just noise with <0x0A><0x0A>?

@timothylimyl
Copy link
Contributor Author

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy
Screenshot_2023-12-07_at_10_21_45_AM

I think the difference is that there's just an extra token 13.

@winglian seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

@NanoCode012
Copy link
Collaborator

@timothylimyl

seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

You can add them to

# bos/eos/pad/unk
special_tokens:

# others
tokens:

Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this a while back but forgot about it. I think this is a good fix for this silent bug.

@timothylimyl
Copy link
Contributor Author

@timothylimyl

seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

You can add them to

# bos/eos/pad/unk
special_tokens:

# others
tokens:

Is axolotl robust enough to deal with the extra tokens in the vocabulary? For example, <|im_start|> and <|im_end|> will need to be added to the tokenizer config while the model final layer has to be extended to accomodate for extra token.

@winglian
Copy link
Collaborator

winglian commented Dec 9, 2023

Is axolotl robust enough to deal with the extra tokens in the vocabulary? For example, <|im_start|> and <|im_end|> will need to be added to the tokenizer config while the model final layer has to be extended to accomodate for extra token.

yes, it handles it correctly. Are you asking specifically about LoRA? (in which case you need to manually specify the lm_head and embed_tokens layers as lora_modules_to_save

@winglian winglian merged commit 03c6318 into axolotl-ai-cloud:main Dec 9, 2023
4 checks passed
@timothylimyl
Copy link
Contributor Author

@winglian

just notice this, for the last few tokens (ending):

<|im_end|>(32000, 32000) (28705, 28705) <0x0A>(13, 13) <|im_end|>(32000, 32000)

why is there a token id 28705 there? Any clues?

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
…cloud#922)

Co-authored-by: Timothy  Lim <timothyyonglee.lim@kxrdev.com>
@noobmaster29
Copy link

The following change seems to fix the double EOS token:

register_conv_template(
Conversation(
name="chatml",
system_template="<|im_start|>system\n{system_message}",
system_message="You are a helpful assistant.",
roles=["<|im_start|>user", "<|im_start|>assistant"],
#sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
stop_str="<|im_end|>",
)
)

20231215_071641
20231215_071623

noobmaster29 added a commit to noobmaster29/axolotl that referenced this pull request Dec 18, 2023
Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. 

axolotl-ai-cloud#922 (comment)
@noobmaster29 noobmaster29 mentioned this pull request Dec 18, 2023
@timothylimyl
Copy link
Contributor Author

@noobmaster29 what happen to your PR?

@noobmaster29
Copy link

Someone on Discord mentioned that the change did not solve the issue of double EOS tokens. I have not had time to replicate it yet but I will try to look at it tomorrow.

@LZY-the-boys
Copy link

Someone on Discord mentioned that the change did not solve the issue of double EOS tokens. I have not had time to replicate it yet but I will try to look at it tomorrow.

It may be caused by dataset cache. Though the code is changed to <|im_end|>\n, the cached dataset may remain <|im_end|>\n\n and is directly loaded for train.

@noobmaster29
Copy link

Could you try a fresh process and see if it still produces double EOS and new line characters?

winglian pushed a commit to noobmaster29/axolotl that referenced this pull request Jan 6, 2024
Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. 

axolotl-ai-cloud#922 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants