[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. #822

anj-s · 2021-10-21T12:23:12Z

What does this PR do?

This PR decouples mixed_precision from move_params_to_cpu. We should now be able to support full FP16 or FP32 workloads with offloading params and grads to CPU.

The main cutpoints that have been modified are when we create the fp16 shard, move params from fp32 to fp16 device and finally when we discard the fp16 shard.

One of the issues with the code is that we have named shards fp16 and fp32 instead of having a more general name such as storage and compute. This means the _fp16_shard may not be fp16. This is very confusing to parse. I will be modifying this in an upcoming PR.

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

min-xu-ai

Looks nice! @zhaojuanmao, do you want to take a look too?

min-xu-ai · 2021-10-25T21:02:26Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+                    if (self.mixed_precision or self.move_params_to_cpu) and not force_full_precision:
+                        self._free_fp16_param_shard([p])
+
+                    if self.move_params_to_cpu and (self.params[0].dtype == self.compute_dtype):


The 2 "if-conditions" are guarding the same code? Why have 2 of them? Having line 1639 and line 1642 duplicated might not be good?

It was more for the sake of readability that I split this up.

min-xu-ai · 2021-10-25T21:22:08Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -180,8 +180,7 @@ class FullyShardedDataParallel(nn.Module):
            if ``True``, flatten parameters into a single contiguous tensor,
            which improves training speed.
        move_params_to_cpu (bool, Optional):
-            if ``True``, offload FP32 params to CPU. This is only relevant when
-            *``mixed_precision``* is ``True``.
+            if ``True``, offload params to CPU.


is there a requirement here that params need to be fp32 not fp16? if so, perhaps mention that here?

There is no requirement for now. FP32 or FP16 params can be offloaded to CPU.

zhaojuanmao · 2021-10-25T21:37:46Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+            p._fp32_shard = p._fp32_shard.pin_memory()
+            p.data = p._fp32_shard
+
+        if self.move_params_to_cpu or self.mixed_precision:


p._fp16_shard is only needed when self.mixed_precision=True, right?

The shard is needed any time you offload params to CPU. The shard is named fp16 which causes confusion. I am renaming the shard in a follow up PR.

* remove offload dependency on fp16 * update python version for cpu tess * run CPU tests with updated PyTorch version * split changes * revert tests config * fix lint errors * update nightly and test PyTorch versions * skip failing multiprocess pipe test * always skip test * always skip test * always skip test * lint error * skip unsupported versions * improve skip message * lint errors * modify docs * add tests * fix test failures * modify comments * fix lint errors * fix lint errors

Anjali Sridhar added 24 commits August 23, 2021 18:33

remove offload dependency on fp16

94a2885

Merge branch 'main' into fp32-offload

d8f40d8

update python version for cpu tess

0f586a1

run CPU tests with updated PyTorch version

214b800

split changes

2d5e148

Merge branch 'main' into fp32-offload

99b07b6

revert tests config

09e2cc3

fix lint errors

b42b1cc

update nightly and test PyTorch versions

b156583

skip failing multiprocess pipe test

e73c82a

always skip test

5020d9b

always skip test

a5eba19

always skip test

c131943

lint error

5a60b11

skip unsupported versions

6fe7eb1

improve skip message

5a82eb0

lint errors

75a5b61

merge main

c708c36

Merge branch 'update-test-pytorch-version' into fp32-offload

bb8298a

modify docs

3ac02e8

Merge branch 'main' into fp32-offload

fe29bb6

Merge branch 'main' into fp32-offload

8392c15

add tests

f58d185

fix test failures

5791c3e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2021

anj-s marked this pull request as draft October 21, 2021 12:23

modify comments

4a54e34

anj-s requested a review from min-xu-ai October 23, 2021 15:40

anj-s marked this pull request as ready for review October 23, 2021 15:40

anj-s changed the title ~~[fix] Decouple CPU offload from the mixed precision parameter.~~ [fix] Decouple move_params_to_cpu from the mixed_precision. Oct 23, 2021

Anjali Sridhar added 2 commits October 23, 2021 08:43

fix lint errors

059551d

Merge branch 'main' into fp32-offload

c142a83

anj-s requested a review from another-pjohnson October 23, 2021 15:47

fix lint errors

31b31aa

min-xu-ai approved these changes Oct 25, 2021

View reviewed changes

zhaojuanmao reviewed Oct 25, 2021

View reviewed changes

anj-s added the FSDP + SSD offload label Oct 26, 2021

Merge branch 'main' into fp32-offload

15acc0f

anj-s merged commit ed7ca76 into main Oct 27, 2021

anj-s deleted the fp32-offload branch October 27, 2021 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. #822

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. #822

anj-s commented Oct 21, 2021 •

edited

Loading

min-xu-ai left a comment

min-xu-ai Oct 25, 2021

anj-s Oct 27, 2021

min-xu-ai Oct 25, 2021

anj-s Oct 27, 2021

zhaojuanmao Oct 25, 2021

anj-s Oct 27, 2021

[fix] Decouple move_params_to_cpu from the mixed_precision. #822

[fix] Decouple move_params_to_cpu from the mixed_precision. #822

Conversation

anj-s commented Oct 21, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai Oct 25, 2021

Choose a reason for hiding this comment

anj-s Oct 27, 2021

Choose a reason for hiding this comment

min-xu-ai Oct 25, 2021

Choose a reason for hiding this comment

anj-s Oct 27, 2021

Choose a reason for hiding this comment

zhaojuanmao Oct 25, 2021

Choose a reason for hiding this comment

anj-s Oct 27, 2021

Choose a reason for hiding this comment

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. #822

[fix] Decouple `move_params_to_cpu` from the `mixed_precision`. #822

anj-s commented Oct 21, 2021 •

edited

Loading