-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. #3259
Conversation
Thanks for the PR! Once the tests pass, you can tag me for review. Also, the PR description is really well done 🙏 -- appreciate the detailed writeup. Fix makes sense to me! Once tests pass I'll approve |
@mvpatel2000 Hi! I fixed the failing tests. Now the |
@mvpatel2000 yapf formatting workflows are failing, but they are not related to files I changed in this PR. How should I proceed? |
I can help out with linting :) will take a look soon! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* add SAM tests with FSDP and DeepSpeed * fix SAM for distributed training * SAMOptimizer always needs ClosureGradScaler * lint --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Description
I'm tracing this bug back to Composer coming from using LLMFoundry after we noticed that our models didn't learn anything when running distributed training with FSDP alongside SAM and a LR scheduler. If running the same setup but w/o distributed training (on a single GPU), everything works fine.
Reproducibility
Just checkout the source branch at the commit
d882d8f4
, which contains only the tests (without the fix), and run:Problem (FSDP)
param_groups[0]['lr']
to0.0
(such asCosineAnnealingWithWarmupScheduler
), it varies the learning rate atSAMOptimizer.param_groups[0]['lr']
, but not the one atbase_optimizer.param_groups[0]['lr']
(which is also reset to0.0
).0.0
for the whole training process, which is why we don't see any learning.Cause (short answer)
SAMOptimizer.param_groups[i]
andbase_optimizer.param_groups[i]
are not referencing the same params because of the way Composer handles the sharding.Cause (long answer)
Event.INIT
,SAMOptimizer
binds references to the param groups inbase_optimizer
., so up to this point both optimizers reference the same param groups.Event.INIT
) Composer initializes FSDP (here), which is weird since the FSDP docs warns about initializing the optimizer after doing the sharding, not the other way around.SAMOptimizer
and bounding to it the new sharded params.Trainer
initialization,SAMOptimizer
andbase_optimizer
reference different param groups:SAMOptimizer
the sharded ones.base_optimizer
the original ones.state.schedulers[0].step()
is done on the'lr'
param ofSAMOptimizer
, which are not the ones referenced bybase_optimizer
, but the real optimizer step is done by thebase_optimizer
, notSAMOptimizer
. So no learning.Problem (DeepSpeed)
DeepSpeedEngine
(here).SAMOptimizer
does the optimizer step iff a closure is passed, whileBF16_Optimizer.initialize_optimizer_states()
needs to do an optimizer step without any closure.Solution
base_optimizer
optimizer instance, and just after all of this wrap the resulting optimizer with aSAMOptimizer
instance, binding references to the sharded param groups.base_optimizer
is a DeepSpeed ZeroBF16_Optimizer
instance, since it uses the same underlying parameter groups as the original PyTorch optimizer (see here and here).This fix is achieved by triggering the
SAMOptimizer
wrapping later in the code. After this, distributed training works as expected.Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)