-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce PartialState
as the device handler in the Trainer
#22752
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, my main question is that why aren't we creating or using already passed Accelerator
object? My vision was that post creating accelerator object here muellerzr@12dcc13#diff-bfceaff300c851b8e24fc50dc6638482abaec8f7d2a718e877c3828c166bcf79R1489
you can fill in the local rank and all other dist setup variables from the accelerator state.
@pacman100 I'm starting small with what is minimally needed with the API. E.g. the AcceleratorState isn't needed until we get into parts such as mixed precision. The device handling can be done separately altogether as it doesn't need to rely on an accelerator object the user passes in This may eventually be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a big problem in the changes of the tests with local_rank
. local_rank==0
means the training was launched in a distributed fashion but with only one process (could be a setup with multiple nodes and one process per node for instance). Those tests need to be switched to use PartialState.distributed_type
and compare it to NO, not to compare the local_rank with 0.
src/transformers/trainer.py
Outdated
@@ -416,8 +416,7 @@ def __init__( | |||
raise ValueError( | |||
"Using --sharded_ddp xxx together with --fsdp is not possible, deactivate one of those flags." | |||
) | |||
|
|||
if args.local_rank == -1: | |||
if args.local_rank == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those checks should on the distributed mode in PartialState.distributed_type == DistributedType.No as this did not error before if a user was launching a distributed training with one process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Before we merge this, can you run all trainer tests (on single and multi GPU) as well as the DeepSpeed tests to make sure this doesn't break anything?
aa28856
to
4e30eef
Compare
PR is good for final review, tested on single + multi gpu + deepspeed and seemed to work fine. Ignore the |
9140c7a
to
762b809
Compare
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
71a1fe8
to
6bf4774
Compare
I don't know if here is a right place to report but I think this merge causing error in Seq2SeqTrainingArguments. before this merge, I was able to run this code
Error messages
|
@rennn2002 Could you try updating your version of accelerate i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, Thank you @zach for adding this 🚀 . Left a couple comments
if ( | ||
torch.distributed.is_available() | ||
and torch.distributed.is_initialized() | ||
and self.distributed_state.distributed_type != DistributedType.NO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be self.distributed_state.distributed_type == DistributedType.NO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixing in a follow up. Thanks! :)
and self.distributed_state.distributed_type != DistributedType.NO | ||
): | ||
logger.warning( | ||
"torch.distributed process group is initialized, but parallel_mode == ParallelMode.DISTRIBUTED. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be parallel_mode != ParallelMode.DISTRIBUTED
?
…ingface#22752) * Use accelerate for device management * Add accelerate to setup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
What does this PR do?
This PR is the start of the
Trainer
integration of AccelerateThe integration will follow multiple stages to ensure small changes happen iteratively. This first one simply changes the device handler/setter to be the
PartialState
inAccelerate
and nothing more. In a follow-up PR, I will start to include more utilities utilizing it.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger @pacman100