-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2_chat.py tokenize_prompt() fails #519
Comments
+1 on this. Having the exact same error. Not clear if it is as simple as @kvikk suggests or maybe something to do with the tokenisation scheme? Surely if this was a simple +1 offset on everything it could never have worked in testing? |
I think we should refactor this to use fastchat's conversation object |
@dandm1 Other things could be wrong, but not having the offset is clearly an error, because of the added |
@kvikk , yes, it fixed it for me too. It just looks from the original PR like several people were testing this, and I don't see how it is possible it passed that testing if the bug is quite so fundamental. |
Hey, were you able to train the model? |
I don't think that's related. It sounds like a driver or installation problem. |
@kvikk Are you able to finetune the model using this prompt strategy ? |
@kvikk This error does not occur on any other prompting strategies such as sharegpt:chat. So it has to be some bug in the code. |
@vibhorag101 I was able to train yes. This warning does not affect training, you can still train, if you don't fix it, it just invalidates samples. (Unfortunately all of them, so there is no point in training.) I have not seen the error you encountered. |
@vibhorag101 |
@kaldeberger Done |
I found the suggested fix did not work for all edge-cases while the one I found fixed both issues |
Fixed after #578 |
Please check that this issue hasn't been reported before.
Expected Behavior
Tokenizing a prompt should work, cur_len should be equal to total_len (check is at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L129 )
Current behaviour
Tokenizer fails with "WARNING: tokenization mismatch ..."
Steps to reproduce
Create a prompt example file with something like:
{"conversations": [{"from": "system", "value": "sytem prompt"}, {"from": "human", "value": "Hello."}, {"from": "gpt", "value": "Nice to meet you!."}]}
Modify yml, to use this file as dataset, set dataset type to llama2_chat.
accelerate launch scripts/finetune.py your_llama2.yml
Possible solution
The length mismatch comes because the tokenizer adds id of
<s>
before ids, so it is 1 token longer than expected. This is handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L120 by subtracting 1, but not handled at https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L113this line should be modified to:
turn_len = len(self.tokenizer(turn).input_ids) - 1
As a clarification the comment at this line https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L124 should be changed. It is not the role token, but sep2:
</s><s>
https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_strategies/llama2_chat.py#L53 and the length becomes 2 because the tokenizer removes the space at the front.A more elegant way would be to handle this the same way as instruction_length, add the separator to turn, and then tokenize.
113
turn_len = len(self.tokenizer(turn + conv.sep2).input_ids) - 1 # tokenizer adds <s> makes "</s><s>" from " </s><s>"
then
124
cur_len += turn_len
Which Operating Systems are you using?
Python Version
3.9.0
axolotl branch-commit
main/f51c9c56c6c319e331c702949f827155ddc5e45b
Acknowledgements
The text was updated successfully, but these errors were encountered: