[FSDP] add no_broadcast_optim_state option #560

sshleifer · 2021-03-31T17:43:31Z

Problem

some parameters are large and their optimizer state cannot be gathered, for example, the experts in mixture of experts.
We need a way to allow callers to control which keys in the optimizer state dict (OSD) are collected to avoid OOM.
- rebuild_full_params already implements this, but the approach is recursive.
Callers can now write custom logic to save OSD state of certain submodules without gathering.

Approach

We set m.no_broadcast_optim_state=True if a child instance m has world_size=1 and it's own process_group.
We remove no_broadcast_optim_state parameters from each rank's OSD before collection.
We add back rank 0's expert parameters to OSD after collection.
This allows us to produce an OSD with the correct number of entries and correct shapes for expert parameters, just like state_dict
We let callers overwrite the state for the expert on rank i with the correct contents.

Determination in `_set_is_root`

m.no_broadcast_optim_state = m.no_broadcast_optim_state or (
(m.world_size == 1) and (m.world_size < self.world_size) and (m.process_group != self.process_group))

the tests show that we can recognize MOE experts like this and avoid OOM without any change to callers.

TODO

Optimization: Do not copy osd['param_groups'] from each worker.

min-xu-ai · 2021-03-31T22:58:41Z

@msbaines Mandeep a failure here on a pipeline test is sporadic?

min-xu-ai

This is nice. I think I mostly followed but can't say I checked the logic of every line. Hopefully the test is sufficient to cover the new case this enables and ensure no regression. Some comments that are not blocking. The only really issue is the commented out assertion, seems scary.

min-xu-ai · 2021-04-01T16:50:15Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

@@ -21,22 +21,22 @@ def flatten_optim_state_dict(sd: Dict) -> Dict:
    non_tensor_state = {}

    # Populate `new_state["state"]`. (Assuming sd is sorted)
-    for expanded_pid, buffers in sd["state"].items():
-        consolidated_pid = param_id_map[expanded_pid]
+    for global_id, buffers in sd["state"].items():


I like the rename of the variables!

min-xu-ai · 2021-04-01T16:51:15Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

-    new_sd = {"state": new_state, "param_groups": sd["param_groups"]}
+            new_state[local_id][buffer_name] = torch.cat(tensors)
+        new_state[local_id].update(non_tensor_state)
+    new_sd = {"state": new_state, "param_groups": copy.deepcopy(sd["param_groups"])}


add a comment on the deep copy? Also, last time I checked that it seems deepcopy doesn't really copy tensors. you may want to double check (with some asserts and testing) to verify the deep copy here is doing the right thing?

Luckily there are no tensors in param_groups, but that's very useful to know!

+1 to adding a comment about param groups not having tensors (thus deepcopy being okay)

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

myleott

LGTM, although I will need to spend some time stepping through the code at some point to fully understand the implementation 😄

Should we also test that no_broadcast_optim_state is inferred properly for the MixtureOfExperts model?

fairscale/nn/data_parallel/fsdp_optim_utils.py

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

myleott · 2021-04-02T23:10:18Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

        name_func=rename_test,
    )
    def test_consolidate_optimizer(self, optim_fn, transformer):
        config = {"mixed_precision": True, "flatten_parameters": True}
+        config["compute_dtype"] = torch.float32


I'm guessing this and the autocast change where required to get tests to pass?

yes, I followed the logic of __test_identical_outputs

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

myleott · 2021-04-02T23:26:37Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

-    new_sd = {"state": new_state, "param_groups": sd["param_groups"]}
+            new_state[local_id][buffer_name] = torch.cat(tensors)
+        new_state[local_id].update(non_tensor_state)
+    new_sd = {"state": new_state, "param_groups": copy.deepcopy(sd["param_groups"])}


+1 to adding a comment about param groups not having tensors (thus deepcopy being okay)

msbaines · 2021-04-03T05:09:51Z

@msbaines Mandeep a failure here on a pipeline test is sporadic?

Yep. Here is a fix: #575

sshleifer added 2 commits March 31, 2021 13:31

gather optim state does not gather children with ws=1

12112fd

rename

27944e8

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 31, 2021

sshleifer changed the title ~~[FSDP] no_broadcast_optim_state option to avoid extra communication~~ [FSDP/optim state] pass no_broadcast_optim_state=True option to avoid OOM Mar 31, 2021

sshleifer changed the title ~~[FSDP/optim state] pass no_broadcast_optim_state=True option to avoid OOM~~ [FSDP] add no_broadcast_optim_state option Mar 31, 2021

sshleifer requested review from myleott and min-xu-ai March 31, 2021 17:44

sshleifer added the FSDP FullyShardedDataParallel (zero-3) label Mar 31, 2021

Merge branch 'master' into fsdp-optim-uncollectable

388e942

min-xu-ai approved these changes Apr 1, 2021

View reviewed changes

sshleifer added 3 commits April 2, 2021 12:21

some cleanup

57f88c4

Min comments

d38e380

lint

0fb1850

myleott approved these changes Apr 2, 2021

View reviewed changes

sshleifer added 4 commits April 3, 2021 15:08

Myle comments

ae5922c

Dont return global ids

c301726

merge master

c8f5f0b

lint

660effa

sshleifer merged commit 1fcbd62 into master Apr 4, 2021

sshleifer deleted the fsdp-optim-uncollectable branch April 4, 2021 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] add no_broadcast_optim_state option #560

[FSDP] add no_broadcast_optim_state option #560

sshleifer commented Mar 31, 2021 •

edited

Loading

min-xu-ai commented Mar 31, 2021

min-xu-ai left a comment

min-xu-ai Apr 1, 2021

min-xu-ai Apr 1, 2021

sshleifer Apr 1, 2021

myleott Apr 2, 2021

myleott left a comment

myleott Apr 2, 2021

sshleifer Apr 3, 2021

myleott Apr 2, 2021

msbaines commented Apr 3, 2021

[FSDP] add no_broadcast_optim_state option #560

[FSDP] add no_broadcast_optim_state option #560

Conversation

sshleifer commented Mar 31, 2021 • edited Loading

Problem

Approach

Determination in _set_is_root

TODO

min-xu-ai commented Mar 31, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai Apr 1, 2021

Choose a reason for hiding this comment

min-xu-ai Apr 1, 2021

Choose a reason for hiding this comment

sshleifer Apr 1, 2021

Choose a reason for hiding this comment

myleott Apr 2, 2021

Choose a reason for hiding this comment

myleott left a comment

Choose a reason for hiding this comment

myleott Apr 2, 2021

Choose a reason for hiding this comment

sshleifer Apr 3, 2021

Choose a reason for hiding this comment

myleott Apr 2, 2021

Choose a reason for hiding this comment

msbaines commented Apr 3, 2021

sshleifer commented Mar 31, 2021 •

edited

Loading

Determination in `_set_is_root`