fixing prompt template of chatml by removal of linebreak #922

timothylimyl · 2023-12-07T06:29:41Z

reference to discord chat:
"""
hi <@208256080092856321> , went through the code step-by-step and found that there's an extra linebreak ('\n') in the separator of ChatML so the separator will end up having two linebreaks . You can see it here: https://github.com/OpenAccess-AI-Collective/axolotl/blob/a581e9f8f66e14c22ec914ee792dd4fe073e62f6/src/axolotl/prompt_strategies/sharegpt.py#L16 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/monkeypatch/fastchat_conversation_turns.py#L117 . A typical silent killer of prompt template for those not aware but the model is most probably robust enough to still reply coherently since it's just a linebreak.
"""

casper-hansen · 2023-12-07T15:13:45Z

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

winglian · 2023-12-07T18:24:05Z

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy

casper-hansen · 2023-12-07T18:37:12Z

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy

Oh that's really bad. Essentially you are putting in a sample with just noise with <0x0A><0x0A>?

timothylimyl · 2023-12-08T01:52:10Z

Could you provide an example with before and after using python -m axolotl.cli.preprocess your_config.yml --debug? Just curious to see the actual difference on a per-token level to verify that this is the case.

here's the current main, definitely something buggy

I think the difference is that there's just an extra token 13.

@winglian seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

NanoCode012 · 2023-12-08T02:53:30Z

@timothylimyl

seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

You can add them to

# bos/eos/pad/unk
special_tokens:

# others
tokens:

NanoCode012

I noticed this a while back but forgot about it. I think this is a good fix for this silent bug.

timothylimyl · 2023-12-08T03:19:32Z

@timothylimyl

seems like you made <|im_start|> and <|im_end|> to be special tokens, can you provide me some pointers on how did you do that in axolotl?

You can add them to
# bos/eos/pad/unk
special_tokens:

# others
tokens:

Is axolotl robust enough to deal with the extra tokens in the vocabulary? For example, <|im_start|> and <|im_end|> will need to be added to the tokenizer config while the model final layer has to be extended to accomodate for extra token.

winglian · 2023-12-09T16:29:04Z

Is axolotl robust enough to deal with the extra tokens in the vocabulary? For example, <|im_start|> and <|im_end|> will need to be added to the tokenizer config while the model final layer has to be extended to accomodate for extra token.

yes, it handles it correctly. Are you asking specifically about LoRA? (in which case you need to manually specify the lm_head and embed_tokens layers as lora_modules_to_save

timothylimyl · 2023-12-13T06:38:06Z

@winglian

just notice this, for the last few tokens (ending):

<|im_end|>(32000, 32000) (28705, 28705) <0x0A>(13, 13) <|im_end|>(32000, 32000)

why is there a token id 28705 there? Any clues?

…cloud#922) Co-authored-by: Timothy Lim <timothyyonglee.lim@kxrdev.com>

noobmaster29 · 2023-12-15T13:19:35Z

The following change seems to fix the double EOS token:

register_conv_template(
Conversation(
name="chatml",
system_template="<|im_start|>system\n{system_message}",
system_message="You are a helpful assistant.",
roles=["<|im_start|>user", "<|im_start|>assistant"],
#sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
stop_str="<|im_end|>",
)
)

Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. axolotl-ai-cloud#922 (comment)

timothylimyl · 2023-12-21T08:33:37Z

@noobmaster29 what happen to your PR?

noobmaster29 · 2023-12-21T08:39:44Z

Someone on Discord mentioned that the change did not solve the issue of double EOS tokens. I have not had time to replicate it yet but I will try to look at it tomorrow.

LZY-the-boys · 2023-12-22T18:21:40Z

Someone on Discord mentioned that the change did not solve the issue of double EOS tokens. I have not had time to replicate it yet but I will try to look at it tomorrow.

It may be caused by dataset cache. Though the code is changed to <|im_end|>\n, the cached dataset may remain <|im_end|>\n\n and is directly loaded for train.

noobmaster29 · 2023-12-23T01:22:06Z

Could you try a fresh process and see if it still produces double EOS and new line characters?

Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. axolotl-ai-cloud#922 (comment)

fixing prompt template of chatml by removal of linebreak

b99603c

NanoCode012 approved these changes Dec 8, 2023

View reviewed changes

winglian merged commit 03c6318 into axolotl-ai-cloud:main Dec 9, 2023
4 checks passed

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

fixing prompt template of chatml by removal of linebreak (axolotl-ai-…

91fe3a1

…cloud#922) Co-authored-by: Timothy Lim <timothyyonglee.lim@kxrdev.com>

noobmaster29 added a commit to noobmaster29/axolotl that referenced this pull request Dec 18, 2023

Update sharegpt.py

dc33178

Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. axolotl-ai-cloud#922 (comment)

noobmaster29 mentioned this pull request Dec 18, 2023

Update sharegpt.py #976

Closed

winglian pushed a commit to noobmaster29/axolotl that referenced this pull request Jan 6, 2024

Update sharegpt.py

d4eb95f

Resolves the double EOS token issue at the end of prompts when using Chatml template with shareGPT.py. axolotl-ai-cloud#922 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing prompt template of chatml by removal of linebreak #922

fixing prompt template of chatml by removal of linebreak #922

timothylimyl commented Dec 7, 2023

casper-hansen commented Dec 7, 2023

winglian commented Dec 7, 2023

casper-hansen commented Dec 7, 2023

timothylimyl commented Dec 8, 2023

NanoCode012 commented Dec 8, 2023

NanoCode012 left a comment

timothylimyl commented Dec 8, 2023

winglian commented Dec 9, 2023

timothylimyl commented Dec 13, 2023

noobmaster29 commented Dec 15, 2023

timothylimyl commented Dec 21, 2023

noobmaster29 commented Dec 21, 2023

LZY-the-boys commented Dec 22, 2023

noobmaster29 commented Dec 23, 2023

fixing prompt template of chatml by removal of linebreak #922

fixing prompt template of chatml by removal of linebreak #922

Conversation

timothylimyl commented Dec 7, 2023

casper-hansen commented Dec 7, 2023

winglian commented Dec 7, 2023

casper-hansen commented Dec 7, 2023

timothylimyl commented Dec 8, 2023

NanoCode012 commented Dec 8, 2023

NanoCode012 left a comment

Choose a reason for hiding this comment

timothylimyl commented Dec 8, 2023

winglian commented Dec 9, 2023

timothylimyl commented Dec 13, 2023

noobmaster29 commented Dec 15, 2023

timothylimyl commented Dec 21, 2023

noobmaster29 commented Dec 21, 2023

LZY-the-boys commented Dec 22, 2023

noobmaster29 commented Dec 23, 2023