-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933
Comments
Are you using a model from a checkpoint folder or the output folder? |
From the output folder
|
I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine. |
I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue. |
Using |
@mgoulao is this a transformers regression then? That particular commit works with zero3 ? |
Yes, it does work with ZeRO 3 however you will get this problem: #1035 |
I had the same error, the transformer library fixes it, but now I get this one. new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( |
I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.
|
This issue is about full finetune, no lora involved. |
I am doing full tine tune, no qlora. |
+1 Zero3_bf16 + Full-finetune
EDIT - Can confirm zero2 works |
I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a |
Please check that this issue hasn't been reported before.
Expected Behavior
I fine-tune a Mistral model with the default zero3.json and
Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using
My accelerate config is
Current behaviour
yields the error
and
yields the error
Steps to reproduce
and thereafter
Config yaml
Possible solution
Seems related to #705 and #709
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/3e3229e2d99bb509784ac72e6589f8a8e406247f
Acknowledgements
The text was updated successfully, but these errors were encountered: