-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DeepSpeed Zero 3 Saving #709
Fix DeepSpeed Zero 3 Saving #709
Conversation
src/axolotl/train.py
Outdated
@@ -134,6 +134,22 @@ def terminate_handler(_, __, model): | |||
# only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file | |||
if cfg.fsdp: | |||
trainer.save_model(cfg.output_dir) | |||
elif cfg.deepspeed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this apply to all zero* stages, or just zero3 as listed in the pr title?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error applies only for zero3. this code change will run for all zero* stages. should work for all zero* stages however.
if we want a smaller change, i think something like this should work:
https://github.com/lm-sys/FastChat/pull/1457/files#diff-82b734e9eda6b4bac9a28b1056d4e0e0676f904e43cede16d7aa6e2d1da3e61bR155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps elif cfg.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3():
just to minimize the blast radius on this change?
Hi @winglian sorry for the delay. This is fixed and tested. I am using After training
We can load the model and use it normally.
|
* Update train.py * add zero3 check * chore: lint --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
@RicardoDominguez @winglian @tokestermw Configaccelerate-config.yaml
config.yaml
deepspeed/zero3.json
Final Modells -lh model-finetuned
ls -lh model-finetuned/checkpoint-4130/
|
Related Issue
#705
Fix
To use accelerate's recommendation here to run
stage3_gather_16bit_weights_on_model_save
.Test
Config file
Run