Support broadcast_buffers in OssDdp #68

myleott · 2020-09-07T16:09:56Z

🚀 Feature

We should add support for the broadcast_buffers flag to OssDdp.

Motivation

Distributed training with BatchNorm requires it. We removed it from the fairseq implementation because it slows things down a bit, but for the generalized implementation here we should add it back (as a configurable option).

Additional context

See documentation for broadcast_buffers in the main DDP module: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html

The text was updated successfully, but these errors were encountered:

blefaudeux · 2020-09-18T19:17:38Z

@myleott just to make sure I got the incentive right, the batchnorm problem is when using OSS in conjunction with a model parallel technique, right ?

myleott · 2020-09-21T20:27:16Z

Nope, this affects any model which has batch norm that uses data parallel. In particular, batch norm keeps running stats which should be synchronized across data parallel workers.

Here's an interesting discussion about a more flexible version of this (not saying we need this, but we should at least have the on/off version): pytorch/pytorch#30718

blefaudeux · 2020-09-22T23:39:54Z

Finally got the time to get back to this @myleott, sorry for being slow. There's still something I don't get, we forcefully sync the model in between the ranks after each step already, by virtue of each rank being responsible for a shard's worth of update (so it needs to be sync'ed to the other ones, for now OSS just shards the optimizer parameter state and the full model state is on each rank). It feels like it already covers this broadcast_buffers need, but I must be missing something

min-xu-ai · 2020-09-23T06:35:50Z

Finally got the time to get back to this @myleott, sorry for being slow. There's still something I don't get, we forcefully sync the model in between the ranks after each step already, by virtue of each rank being responsible for a shard's worth of update (so it needs to be sync'ed to the other ones, for now OSS just shards the optimizer parameter state and the full model state is on each rank). It feels like it already covers this broadcast_buffers need, but I must be missing something

I think the module buffers are separate list of tensors that are different from the module params. They are not updated by the optimizer. Check the source of the original ddp code linked by Myle. The buffers are also part of the module's parameters but NOT in the params list and not updated by the optimizer. They are part of the checkpoint and update by the layers (like BN, but without backprop).

blefaudeux · 2020-09-23T23:00:41Z

Finally got the time to get back to this @myleott, sorry for being slow. There's still something I don't get, we forcefully sync the model in between the ranks after each step already, by virtue of each rank being responsible for a shard's worth of update (so it needs to be sync'ed to the other ones, for now OSS just shards the optimizer parameter state and the full model state is on each rank). It feels like it already covers this broadcast_buffers need, but I must be missing something

I think the module buffers are separate list of tensors that are different from the module params. They are not updated by the optimizer. Check the source of the original ddp code linked by Myle. The buffers are also part of the module's parameters but NOT in the params list and not updated by the optimizer. They are part of the checkpoint and update by the layers (like BN, but without backprop).

ah thanks, yes makes a lot of sense, I read too fast and was thinking about the model params. Ok, this is not sync'ed indeed

msbaines assigned min-xu-ai Sep 15, 2020

blefaudeux assigned blefaudeux and unassigned min-xu-ai Sep 18, 2020

blefaudeux added the bug Something isn't working label Sep 21, 2020

blefaudeux added a commit that referenced this issue Sep 23, 2020

Initial support for #68

2876d38

blefaudeux mentioned this issue Sep 24, 2020

[ShardedDDP] Sync buffers + small cleanup #112

Merged

4 tasks

blefaudeux closed this as completed in #112 Sep 29, 2020

myleott pushed a commit that referenced this issue Feb 22, 2021

[CI] use parameterized.expand to make each test faster (#68)

ec4e75e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support broadcast_buffers in OssDdp #68

Support broadcast_buffers in OssDdp #68

myleott commented Sep 7, 2020

blefaudeux commented Sep 18, 2020

myleott commented Sep 21, 2020

blefaudeux commented Sep 22, 2020 •

edited

Loading

min-xu-ai commented Sep 23, 2020

blefaudeux commented Sep 23, 2020

Support broadcast_buffers in OssDdp #68

Support broadcast_buffers in OssDdp #68

Comments

myleott commented Sep 7, 2020

🚀 Feature

Motivation

Additional context

blefaudeux commented Sep 18, 2020

myleott commented Sep 21, 2020

blefaudeux commented Sep 22, 2020 • edited Loading

min-xu-ai commented Sep 23, 2020

blefaudeux commented Sep 23, 2020

blefaudeux commented Sep 22, 2020 •

edited

Loading