[fix][ShardedDDP] Properly handle .eval() mode #587

blefaudeux · 2021-04-07T18:35:52Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes eval mode not being properly handled by ShardedDDP, at best it kept all the grad hooks in place, at worst it could crash when there was a trainability change which should have been ignored. Add a unit test to catch that

Closes #586
cc @SeanNaren @ananthsub

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

tests/nn/data_parallel/test_sharded_ddp_features.py

anj-s · 2021-04-07T19:53:27Z

fairscale/nn/data_parallel/sharded_ddp.py

@@ -490,6 +486,9 @@ def _setup_backward_hooks(self) -> None:
        # Go through the parameters, attach the hook
        self._grad_accs = []
        self._manual_reduce = []
+        if not self.training:


do we need to remove existing hooks in eval model? Just curious, otherwise we could move this to the top of the function.

I thought that it was better for correctness, in that if there's a .backward() left somewhere it still respects the eval() setting ? The documentation is not super clear, to me at least https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=eval#torch.nn.Module.train

anj-s · 2021-04-07T19:53:38Z

fairscale/nn/data_parallel/sharded_ddp.py

@@ -624,3 +623,19 @@ def _flush_reduce_calls(self) -> None:
                bucket.sent = True

        self._consume_work_handles()
+
+    def _detect_train_change(self) -> bool:


anj-s · 2021-04-07T19:56:06Z

fairscale/nn/data_parallel/sharded_ddp.py

+        trainability_changed = trainable_mask != self._reference_trainable_mask
+
+        # - the whole model is not trainable but we still have grad hooks
+        trainability_changed |= not self.training and len(self._grad_hooks) > 0


does this mean that grad_hooks should be greater than 0 in eval model? Not sure I understand why this should be the case.

it was meant to detect that the trainability changed, ie. we're in eval() mode but there are grad_hooks in place so we should refresh ? it's tied to the question above, I'm not sure of the reference behavior here

From my offline conversation with @blefaudeux to understand this better:

We can't detect when a module switches from train->eval unless we use the presence of hooks as an indicator.

We refresh trainable 1) at the beginning 2) when params changes their requires_grad property 3) train<->eval switch.

Thanks for the explanation @blefaudeux !

fairscale/nn/data_parallel/sharded_ddp.py

SeanNaren · 2021-04-07T21:47:59Z

Fixes the issue upstream in Lightning, thanks so much for the quick fix @blefaudeux :)

Properly handle .train() and .eval() modes

979ffa0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2021

blefaudeux requested review from min-xu-ai, msbaines and anj-s April 7, 2021 18:36

showing that the unit test works, now fixed

eaf11a2

anj-s reviewed Apr 7, 2021

View reviewed changes

tests/nn/data_parallel/test_sharded_ddp_features.py Outdated Show resolved Hide resolved

anj-s reviewed Apr 7, 2021

View reviewed changes

fairscale/nn/data_parallel/sharded_ddp.py Outdated Show resolved Hide resolved

code review

df5d924

SeanNaren mentioned this pull request Apr 7, 2021

[Fix] Ensure we set the eval/train flag correctly on accelerator model Lightning-AI/pytorch-lightning#6877

Merged

11 tasks

anj-s approved these changes Apr 7, 2021

View reviewed changes

blefaudeux merged commit ce1f2ce into master Apr 7, 2021

blefaudeux deleted the shardedddp_handle_training_switch branch April 7, 2021 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][ShardedDDP] Properly handle .eval() mode #587

[fix][ShardedDDP] Properly handle .eval() mode #587

blefaudeux commented Apr 7, 2021 •

edited

Loading

anj-s Apr 7, 2021

blefaudeux Apr 7, 2021

anj-s Apr 7, 2021

anj-s Apr 7, 2021

blefaudeux Apr 7, 2021

anj-s Apr 7, 2021

SeanNaren commented Apr 7, 2021

[fix][ShardedDDP] Properly handle .eval() mode #587

[fix][ShardedDDP] Properly handle .eval() mode #587

Conversation

blefaudeux commented Apr 7, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

anj-s Apr 7, 2021

Choose a reason for hiding this comment

blefaudeux Apr 7, 2021

Choose a reason for hiding this comment

anj-s Apr 7, 2021

Choose a reason for hiding this comment

anj-s Apr 7, 2021

Choose a reason for hiding this comment

blefaudeux Apr 7, 2021

Choose a reason for hiding this comment

anj-s Apr 7, 2021

Choose a reason for hiding this comment

SeanNaren commented Apr 7, 2021

blefaudeux commented Apr 7, 2021 •

edited

Loading