-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP/Accelerate: Training can't be continued from checkpoint with SHARDED_STATE_DICT #26186
Comments
cc @pacman100 |
I believe this will be fixed by #26180 will review |
many thanks, very timely and it does indeed solve the issue! Commented on the PR with a follow-up issue but will close this as the specific issue is solved by the PR. |
I am facing this exact issue. What is the script that will consolidate the fsdp model shards as a single file? I have the checkpoint but no way to save the model. |
Try out #26180 (there @pacman100 also linked to the torch methods to directly load sharded state dicts). Unfortunately, as it currently stands, you can start training, create checkpoints, finish training and save the model but still run OOM when trying to continue from a checkpoint, so if you max out VRAM during your training runs, checkpoints are currently useless with SHARDED_STATE_DICT :/. |
@jphme |
System Info
transformers
version: 4.34.0.dev0Who can help?
cc @pacman100
I can´t continue Training from Checkpoints that were created with
fsdp_state_dict_type: SHARDED_STATE_DICT
via FSDP/ Accelerate. The rest of the training (and also model saving after callingtrainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
when the training has finished) works fine.This is the error:
My FSDP config:
Checkpoint contents:
At first I thought this is just an error because the trainer expects a
pytorch_model.bin
which isn't available in the directory (seetransformers/src/transformers/trainer.py
Line 2085 in 2518e36
However when trying to call
load_fsdp_model(self.accelerator.state.fsdp_plugin, self.accelerator, model, resume_from_checkpoint)
directly in_load_from_checkpoint
, i get the following error:Content of
self.accelerator.state.fsdp_plugin
:Any idea on how to fix this? Many thanks!
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
see above
Expected behavior
Training can be resumed from checkpoints.
The text was updated successfully, but these errors were encountered: