Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama2_chat.py tokenize_prompt() fails #519

Closed
6 of 8 tasks
kvikk opened this issue Sep 1, 2023 · 13 comments
Closed
6 of 8 tasks

llama2_chat.py tokenize_prompt() fails #519

kvikk opened this issue Sep 1, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@kvikk
Copy link

kvikk commented Sep 1, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Tokenizing a prompt should work, cur_len should be equal to total_len (check is at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L129 )

Current behaviour

Tokenizer fails with "WARNING: tokenization mismatch ..."

Steps to reproduce

Create a prompt example file with something like:
{"conversations": [{"from": "system", "value": "sytem prompt"}, {"from": "human", "value": "Hello."}, {"from": "gpt", "value": "Nice to meet you!."}]}

Modify yml, to use this file as dataset, set dataset type to llama2_chat.

accelerate launch scripts/finetune.py your_llama2.yml

Possible solution

The length mismatch comes because the tokenizer adds id of <s> before ids, so it is 1 token longer than expected. This is handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L120 by subtracting 1, but not handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L113
this line should be modified to:
turn_len = len(self.tokenizer(turn).input_ids) - 1

As a clarification the comment at this line https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L124 should be changed. It is not the role token, but sep2: </s><s> https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L53 and the length becomes 2 because the tokenizer removes the space at the front.

A more elegant way would be to handle this the same way as instruction_length, add the separator to turn, and then tokenize.
113
turn_len = len(self.tokenizer(turn + conv.sep2).input_ids) - 1 # tokenizer adds <s> makes "</s><s>" from " </s><s>"
then
124
cur_len += turn_len

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.9.0

axolotl branch-commit

main/f51c9c56c6c319e331c702949f827155ddc5e45b

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@kvikk kvikk added the bug Something isn't working label Sep 1, 2023
@dandm1
Copy link
Contributor

dandm1 commented Sep 10, 2023

+1 on this. Having the exact same error. Not clear if it is as simple as @kvikk suggests or maybe something to do with the tokenisation scheme? Surely if this was a simple +1 offset on everything it could never have worked in testing?

@winglian
Copy link
Collaborator

I think we should refactor this to use fastchat's conversation object

@kvikk
Copy link
Author

kvikk commented Sep 12, 2023

@dandm1 Other things could be wrong, but not having the offset is clearly an error, because of the added <s> token. As I said it is handled when setting the instruction_length variable, the turn_length variable should be calculated the same way.
It fixed the issue with my training.

@dandm1
Copy link
Contributor

dandm1 commented Sep 12, 2023

@kvikk , yes, it fixed it for me too. It just looks from the original PR like several people were testing this, and I don't see how it is possible it passed that testing if the bug is quite so fundamental.

@vibhorag101
Copy link

Hey, were you able to train the model?
I am getting the following error when trying to train llama-2 using this prompt strategy, after making the changes you suggested.
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

@kvikk
Copy link
Author

kvikk commented Sep 13, 2023

I don't think that's related. It sounds like a driver or installation problem.

@vibhorag101
Copy link

@kvikk Are you able to finetune the model using this prompt strategy ?

@vibhorag101
Copy link

@kvikk This error does not occur on any other prompting strategies such as sharegpt:chat. So it has to be some bug in the code.
My dataset looks something like this. Any help would be very much appreciated !
[
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "I've been feeling so sad and overwhelmed lately. Work has become such a massive source of stress for me."
},
{
"from": "gpt",
"value": "Hey there, I'm here to listen and support you. It sounds like work has been really challenging lately. Can you tell me more about what's been going on?"
},
]
}
}

@kvikk
Copy link
Author

kvikk commented Sep 14, 2023

@vibhorag101 I was able to train yes. This warning does not affect training, you can still train, if you don't fix it, it just invalidates samples. (Unfortunately all of them, so there is no point in training.) I have not seen the error you encountered.

@kaldeberger
Copy link

Hey, were you able to train the model? I am getting the following error when trying to train llama-2 using this prompt strategy, after making the changes you suggested. RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

@vibhorag101
I believe I ran into the same bug. I opened a separate issue for it. Can you chime in there?
#568

@vibhorag101
Copy link

@kaldeberger Done

@dimichgh
Copy link

I found the suggested fix did not work for all edge-cases while the one I found fixed both issues
#568 (comment)

@kvikk
Copy link
Author

kvikk commented Oct 6, 2023

Fixed after #578

@kvikk kvikk closed this as completed Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants