[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test #556

min-xu-ai · 2021-03-29T02:00:59Z

beefed up unit test with regnet-like model
found that single-rank process group is causing problem
disabled it to enable convergence tests on the vissl side
use raise e from None to get a better assertion output
in testing.py.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes # (issue).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

- beefed up unit test with regnet-like model - found that single-rank process group is causing problem - disabled it to enable convergence tests on the vissl side - use `raise e from None` to get a better assertion output in testing.py.

min-xu-ai · 2021-03-29T02:05:19Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1543,10 +1546,15 @@ def wrap_bn_only_policy(module: nn.Module, recurse: bool, unwrapped_params: int)
        else:
            return is_bn and not isinstance(module, tuple(default_auto_wrap_policy.EXCLUDE_WRAP_MODULES))  # type: ignore

-    my_rank = dist.get_rank()
+    pg = None
+    if single_rank_pg:


this is the main change. For now, I put this in an option and may need it later with apex BN or performance. This optional seems to be buggy in FSDP. Need more debugging. Enabling it also has impact on state_dict saving/loading as well as the optimizer state.

min-xu-ai · 2021-03-29T02:06:33Z

fairscale/utils/testing.py

@@ -445,7 +445,7 @@ def objects_are_equal(a: Any, b: Any, raise_exception: bool = False, dict_key: O
                    # Add dict key to the assertion error.
                    msg = e.args[0]
                    new_msg = f"For dict key '{dict_key}': {msg}"
-                    raise AssertionError(new_msg)
+                    raise AssertionError(new_msg) from None


this makes assertion output better, it doesn't print the first exception and then print "while handling the first exception the second exception is raised".

min-xu-ai · 2021-03-29T02:48:21Z

wow, this is passing locally on my machine but failing on CI. converting to draft for more debugging.

prigoyal · 2021-03-29T13:53:23Z

this is such an awesome work @min-xu-ai .

In case helpful , just adding a little more context on how we use syncBN: in SyncBatchNorm (Apex or Pytorch), the process groups are indeed very important. People frequently change process groups (for example, for 128 gpus, we can use process groups of size 64 which leads to 2groups, or we can use process groups of size 8 leading to 16 groups etc.).

min-xu-ai · 2021-03-29T14:41:45Z

this is such an awesome work @min-xu-ai .

In case helpful , just adding a little more context on how we use syncBN: in SyncBatchNorm (Apex or Pytorch), the process groups are indeed very important. People frequently change process groups (for example, for 128 gpus, we can use process groups of size 64 which leads to 2groups, or we can use process groups of size 8 leading to 16 groups etc.).

Thanks, @prigoyal . This is helpful info. I think we want to support apex sync bn as well but it currently has more issues than pytorch BN.

The process group here is actually NOT the group of the pytorch sync BN group. The group here is FSDP sharding group. I used to have each rank's BN in their own single rank groups, which avoids sharding them. But that seems to have bugs. So I am undoing that here.

- need AMP context in FSDP - workaround different between ddp & fsdp when bias=True - fixed a bug in input data generation that caused different ranks have the same data with wrong iteration count. - added TODO for need a better loss and grad_scaler and reduced iters so there is no nan. - added a (disabled) debugging code

min-xu-ai · 2021-03-30T07:01:23Z

I think the test is going to pass and this change should be ready to review after that. High level updates:

only on CUDA 11 and up, I can make mixed precision DDP and FSDP the same
even there, Linear layer can't have bias=True (could be a FSDP bug to be fixed)
have to use pytorch AMP context for both DDP and FSDP
have to use FP32 grad reduction
have to use single rank group for BNs for both full and mixed precision modes (this could be a FSDP bug to be fixed)

min-xu-ai · 2021-03-30T07:02:11Z

fairscale/utils/testing.py

+_smi_ver = None
+
+
+def torch_cuda_version(compiled: bool = False) -> Tuple[int, ...]:


added a new function to either get torch compiled cuda version or Nvidia-smi reported cuda version.

min-xu-ai · 2021-03-30T07:02:38Z

tests/ci_test_list_2.txt

@@ -33,4 +33,8 @@ tests/nn/pipe/test_deferred_batch_norm.py
 tests/nn/pipe/test_dependency.py
 tests/nn/pipe/test_stream.py
 tests/experimental/nn/test_multiprocess_pipe.py
+tests/nn/moe/test_moe_layer.py


move a few tests around to balance the CI runtime.

min-xu-ai · 2021-03-30T07:06:15Z

tests/nn/data_parallel/test_fsdp_regnet.py


    model = Model()
    model.load_state_dict(state_before)
    model = model.cuda()

+    class DummyScaler:


@anj-s I used this DummyScaler when scaling is not needed in full precision mode. Could be a useful thing to do in general to unify full and mixed precision training loop.

min-xu-ai · 2021-03-30T07:34:24Z

tests/nn/data_parallel/test_fsdp_regnet.py

@@ -215,7 +334,10 @@ def test1(temp_files, ddp_ref, precision, flatten):
    fsdp_config["mixed_precision"] = precision == "mixed"
    fsdp_config["flatten_parameters"] = flatten == "flatten"

-    world_size = 2
+    if fsdp_config["mixed_precision"] and torch_cuda_version() < (11, 0):


@msbaines It is great that CI has a CUDA 11 VM. Otherwise, this test would have no place to run. It seems that we should gradually move everything to CUDA 11.

prigoyal · 2021-03-30T12:33:38Z

really awesome work @min-xu-ai and amazing insights!! :)

only on CUDA 11 and up, I can make mixed precision DDP and FSDP the same
I am curious why this is the case. Is there a particular layer that leads to non-match when we use different CUDA. NOTE that if we mandate CUDA 11, we need to confirm that the machines in Fblearner will have CUDA 11 as we will plan to train the model there. :)

even there, Linear layer can't have bias=True (could be a FSDP bug to be fixed)

this is indeed interesting. In general, not having bias can introduce regularization issue and exploding training. I would encourage that we debug this.

have to use pytorch AMP context for both DDP and FSDP

Just double checking: we use the context for the autocasting only? I think this is awesome! pytorchAMP is mandated also by SDP so we can make a consistent switch.

have to use FP32 grad reduction

interesting: this can slow down speed seems like. Curious question: why do we require fp32 grad reduction? if this something we require for all layers or simply because of BN (BN params should be all kept in fp32).

have to use single rank group for BNs for both full and mixed precision modes (this could be a FSDP bug to be fixed)

interesting. This is essentially plain BN and not syncBN if we have to use single rank BN. Definitely sounds like bug somewhere.

Overall, seems like BN is introducing a lot of weirdness.

anj-s · 2021-03-31T17:58:52Z

tests/nn/data_parallel/test_fsdp_regnet.py

+        self.trunk.add_module("stem", stem)
+        self.trunk.add_module("any_stage_block1", Sequential(any_stage_block1_0))
+
+        # head


nit: maybe remove since the variable is named as head?

good point. will do.

anj-s · 2021-03-31T17:59:34Z

tests/nn/data_parallel/test_fsdp_regnet.py

+        # head
+        self.head = Sequential(
+            # TODO (Min): FSDP-mixed_precision doesn't compute the same ways as DDP AMP when bias=True.
+            #             so, we use bias=False for now in the projection_head.


does the second line need a tab?

yeah, it is not a tab, I was aligning it but I should do a 4-space indent. will fix.

anj-s · 2021-03-31T17:59:57Z

tests/nn/data_parallel/test_fsdp_regnet.py

+        self.head = Sequential(
+            # TODO (Min): FSDP-mixed_precision doesn't compute the same ways as DDP AMP when bias=True.
+            #             so, we use bias=False for now in the projection_head.
+            #             The Conv2d layers above does not used bias in regnet, but even if they use


good catch!

anj-s · 2021-03-31T18:00:36Z

tests/nn/data_parallel/test_fsdp_regnet.py

+
+        # head
+        self.head = Sequential(
+            # TODO (Min): FSDP-mixed_precision doesn't compute the same ways as DDP AMP when bias=True.


Thank you for the context with the comments! very helpful to grok :)

anj-s · 2021-03-31T18:05:35Z

tests/nn/data_parallel/test_fsdp_regnet.py

+            scaler = ShardedGradScaler()
+        # Print the model for verification.
+        if rank == 0:
+            print(model)


Is this useful in tests?

yes, it is useful in verifying the way FSDP wrapped each module. It is a bit long output, but normally not shown unless there is a failure.

anj-s · 2021-03-31T18:08:22Z

tests/nn/data_parallel/test_fsdp_regnet.py


    for in_data in inputs[rank]:
        in_data = in_data.cuda()
        context = contextlib.suppress()
        if ddp and ddp_mixed_precision:
            in_data = in_data.half()
            context = torch.cuda.amp.autocast(enabled=True)
+        if not ddp and fsdp_config["mixed_precision"]:
+            context = torch.cuda.amp.autocast(enabled=True)


just curious, when will this be true?

this is when we use FSDP in mixed precision mode. So we need to turn on pytorch AMP.

anj-s · 2021-03-31T18:12:04Z

tests/nn/data_parallel/test_fsdp_regnet.py

@@ -190,6 +310,15 @@ def _test_func(
        # Move tensors to CPU to compare numerics.
        for k, v in fsdp_state.items():
            fsdp_state[k] = v.cpu()
+        # Enable for debugging.


nit: Should we remove this for long running tests? I guess you are still debugging the issue so it makes sense to have it. only concern here is that it adds to the execution time.

it is disabled with "if False". I will clarify it.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2021

min-xu-ai requested review from myleott, prigoyal, msbaines, anj-s and blefaudeux March 29, 2021 02:01

min-xu-ai commented Mar 29, 2021

View reviewed changes

min-xu-ai marked this pull request as draft March 29, 2021 02:47

min-xu-ai added 2 commits March 29, 2021 17:40

Merge remote-tracking branch 'origin/master' into min/regnet

301d60a

min-xu-ai force-pushed the min/regnet branch from c6fb77c to 301d60a Compare March 30, 2021 00:44

min-xu-ai added 2 commits March 29, 2021 17:58

lint

8c04065

lint

4208a15

min-xu-ai changed the title ~~[fix] disable single rank process group for auto_wrap_bn~~ [fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test Mar 30, 2021

min-xu-ai added 7 commits March 29, 2021 20:12

add scaler

927a6a0

lint

c34bc70

scaler

27d8333

add a real loss

73e19ea

seeding in the ranks

eb85041

blance tests

cda3a8e

run AMP DDP==FSDP test only on cuda version 11 and up

5fa4c5e

min-xu-ai marked this pull request as ready for review March 30, 2021 06:57

min-xu-ai commented Mar 30, 2021

View reviewed changes

min-xu-ai added 3 commits March 30, 2021 16:44

add relu inplace and comment

b523da4

make wrap_bn covers more cases in full precision mode

621ec86

Merge remote-tracking branch 'origin/master' into min/regnet

a5e7be8

anj-s reviewed Mar 31, 2021

View reviewed changes

anj-s approved these changes Mar 31, 2021

View reviewed changes

min-xu-ai added 2 commits March 31, 2021 11:29

Merge remote-tracking branch 'origin/master' into min/regnet

8818326

addressed comments

ce9095b

min-xu-ai merged commit a0458b9 into master Mar 31, 2021

min-xu-ai deleted the min/regnet branch March 31, 2021 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test #556

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test #556

min-xu-ai commented Mar 29, 2021 •

edited

Loading

min-xu-ai Mar 29, 2021

min-xu-ai Mar 29, 2021

min-xu-ai commented Mar 29, 2021

prigoyal commented Mar 29, 2021

min-xu-ai commented Mar 29, 2021

min-xu-ai commented Mar 30, 2021

min-xu-ai Mar 30, 2021

min-xu-ai Mar 30, 2021

min-xu-ai Mar 30, 2021

min-xu-ai Mar 30, 2021

prigoyal commented Mar 30, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

anj-s Mar 31, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

anj-s Mar 31, 2021

min-xu-ai Mar 31, 2021

		_smi_ver = None


		def torch_cuda_version(compiled: bool = False) -> Tuple[int, ...]:

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test #556

[fix] FSDP: disable single rank process group for auto_wrap_bn and fixed mixed precision regnet test #556

Conversation

min-xu-ai commented Mar 29, 2021 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai commented Mar 29, 2021

prigoyal commented Mar 29, 2021

min-xu-ai commented Mar 29, 2021

min-xu-ai commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prigoyal commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai commented Mar 29, 2021 •

edited

Loading