llama2_chat.py tokenize_prompt() fails #519

kvikk · 2023-09-01T09:07:16Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Tokenizing a prompt should work, cur_len should be equal to total_len (check is at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L129 )

Current behaviour

Tokenizer fails with "WARNING: tokenization mismatch ..."

Steps to reproduce

Create a prompt example file with something like:
{"conversations": [{"from": "system", "value": "sytem prompt"}, {"from": "human", "value": "Hello."}, {"from": "gpt", "value": "Nice to meet you!."}]}

Modify yml, to use this file as dataset, set dataset type to llama2_chat.

accelerate launch scripts/finetune.py your_llama2.yml

Possible solution

The length mismatch comes because the tokenizer adds id of <s> before ids, so it is 1 token longer than expected. This is handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L120 by subtracting 1, but not handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L113
this line should be modified to:
turn_len = len(self.tokenizer(turn).input_ids) - 1

As a clarification the comment at this line https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L124 should be changed. It is not the role token, but sep2: </s><s> https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L53 and the length becomes 2 because the tokenizer removes the space at the front.

A more elegant way would be to handle this the same way as instruction_length, add the separator to turn, and then tokenize.
113
turn_len = len(self.tokenizer(turn + conv.sep2).input_ids) - 1 # tokenizer adds <s> makes "</s><s>" from " </s><s>"
then
124
cur_len += turn_len

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.9.0

axolotl branch-commit

main/f51c9c56c6c319e331c702949f827155ddc5e45b

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

dandm1 · 2023-09-10T16:09:21Z

+1 on this. Having the exact same error. Not clear if it is as simple as @kvikk suggests or maybe something to do with the tokenisation scheme? Surely if this was a simple +1 offset on everything it could never have worked in testing?

winglian · 2023-09-10T17:03:27Z

I think we should refactor this to use fastchat's conversation object

kvikk · 2023-09-12T12:26:13Z

@dandm1 Other things could be wrong, but not having the offset is clearly an error, because of the added <s> token. As I said it is handled when setting the instruction_length variable, the turn_length variable should be calculated the same way.
It fixed the issue with my training.

dandm1 · 2023-09-12T19:57:13Z

@kvikk , yes, it fixed it for me too. It just looks from the original PR like several people were testing this, and I don't see how it is possible it passed that testing if the bug is quite so fundamental.

vibhorag101 · 2023-09-13T15:09:44Z

Hey, were you able to train the model?
I am getting the following error when trying to train llama-2 using this prompt strategy, after making the changes you suggested.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

kvikk · 2023-09-13T15:48:49Z

I don't think that's related. It sounds like a driver or installation problem.

vibhorag101 · 2023-09-13T16:10:59Z

@kvikk Are you able to finetune the model using this prompt strategy ?

vibhorag101 · 2023-09-13T18:07:03Z

@kvikk This error does not occur on any other prompting strategies such as sharegpt:chat. So it has to be some bug in the code.
My dataset looks something like this. Any help would be very much appreciated !
[
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "I've been feeling so sad and overwhelmed lately. Work has become such a massive source of stress for me."
},
{
"from": "gpt",
"value": "Hey there, I'm here to listen and support you. It sounds like work has been really challenging lately. Can you tell me more about what's been going on?"
},
]
}
}

kvikk · 2023-09-14T07:49:04Z

@vibhorag101 I was able to train yes. This warning does not affect training, you can still train, if you don't fix it, it just invalidates samples. (Unfortunately all of them, so there is no point in training.) I have not seen the error you encountered.

kaldeberger · 2023-09-15T00:25:24Z

Hey, were you able to train the model? I am getting the following error when trying to train llama-2 using this prompt strategy, after making the changes you suggested. RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

@vibhorag101
I believe I ran into the same bug. I opened a separate issue for it. Can you chime in there?
#568

vibhorag101 · 2023-09-15T07:18:41Z

@kaldeberger Done

dimichgh · 2023-09-20T23:15:54Z

I found the suggested fix did not work for all edge-cases while the one I found fixed both issues
#568 (comment)

kvikk · 2023-10-06T10:37:55Z

Fixed after #578

kvikk added the bug Something isn't working label Sep 1, 2023

kvikk closed this as completed Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama2_chat.py tokenize_prompt() fails #519

llama2_chat.py tokenize_prompt() fails #519

kvikk commented Sep 1, 2023 •

edited

Loading

dandm1 commented Sep 10, 2023

winglian commented Sep 10, 2023

kvikk commented Sep 12, 2023

dandm1 commented Sep 12, 2023

vibhorag101 commented Sep 13, 2023

kvikk commented Sep 13, 2023

vibhorag101 commented Sep 13, 2023

vibhorag101 commented Sep 13, 2023

kvikk commented Sep 14, 2023

kaldeberger commented Sep 15, 2023

vibhorag101 commented Sep 15, 2023

dimichgh commented Sep 20, 2023

kvikk commented Oct 6, 2023

llama2_chat.py tokenize_prompt() fails #519

llama2_chat.py tokenize_prompt() fails #519

Comments

kvikk commented Sep 1, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

dandm1 commented Sep 10, 2023

winglian commented Sep 10, 2023

kvikk commented Sep 12, 2023

dandm1 commented Sep 12, 2023

vibhorag101 commented Sep 13, 2023

kvikk commented Sep 13, 2023

vibhorag101 commented Sep 13, 2023

vibhorag101 commented Sep 13, 2023

kvikk commented Sep 14, 2023

kaldeberger commented Sep 15, 2023

vibhorag101 commented Sep 15, 2023

dimichgh commented Sep 20, 2023

kvikk commented Oct 6, 2023

kvikk commented Sep 1, 2023 •

edited

Loading