[FSDP][feature] optimizer state dict save and load #537

sshleifer · 2021-03-19T03:58:15Z

Overview

# save
fsdp = FSDP(world_size=4)
optim = Adam(fsdp.parameters())
full_state_dict = fsdp.gather_full_optim_state_dict(optim, recipient_rank=0))
# this is None if you are not on a recipient rank
# recipient_rank=None  performs the same consolidation on all ranks.

# load with different world size
fsdp2 = FSDP(world_size=2)
optim2 = Adam(fsdp2.parameters()
state_shard = fsdp2.get_shard_from_optim_state_dict(full_state_dict)
optim2.load_state_dict(state_shard)

Future Work

fairseq integration
support flatten_parameters=False
support param groups
test more nested setups. An FSDP with no params could certainly break this.

On the fairseq side, I tested running with 4 gpus and loading with 2 and this worked.

Assumptions

(0) flatten_parameters=True

(1) if there is a tensor in optimizer state, it is the same size and corresponds to a tensor in model state. If there are singleton tensors in the optimizer, or tensors that correspond to the average update for a column of params (so shaped differently), things will break.

(2) We assume that these two lists are the same if we account for padding:

mlist = [sum(m._param_numels) for m in self.modules() if isinstance(m, FullyShardedDataParallel)]
params = [p.numel() for p in self.parameters()]

we use this assumption to call
mlist[i].get_params_view(flat_param=params_unpadded[i]).

New overhead introduced

_get_shard now returns how many padding elements it introduced.

sshleifer · 2021-03-22T14:16:51Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1346,6 +1352,244 @@ def assert_state(self, state: Union[TrainingState, List[TrainingState]]) -> None
                traceback.print_stack()
            raise ValueError(msg)

+    # Optim State dict functions


I considered moving these to a separate FSDPOptimizerMixin in fsdp_optimizer_utils.py, but decided it wasn't really a mixin since it depends heavily on FSDP.

myleott

About half-way through, leaving initial comments and will post rest in second batch

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

myleott · 2021-03-22T16:27:48Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+            if rank == self.rank:
+                sd = optim.state_dict()
+                sd["num_padded"] = [m.num_padded for m in self.modules() if isinstance(m, FullyShardedDataParallel)]
+                if should_collect_state:
+                    _all_optimizer_states.append(
+                        recursive_copy_to_device(sd, non_blocking=True, device=torch.device("cpu"))
+                    )
+
+                # Sync with other replicas
+                state_to_share = (
+                    sd if should_send_state else torch.tensor([0], dtype=torch.uint8, device=_default_device)
+                )
+                broadcast_object(
+                    state_to_share, src_rank=self.rank, group=self.process_group, dist_device=_default_device,
+                )
+            else:
+                # Fetch the optim state from the other replicas
+                replica_state = broadcast_object(
+                    torch.tensor([0], dtype=torch.uint8, device=_default_device),
+                    src_rank=rank,
+                    group=self.process_group,
+                    dist_device=_default_device,
+                )
+
+                if should_collect_state:
+                    _all_optimizer_states.append(
+                        recursive_copy_to_device(replica_state, non_blocking=True, device=torch.device("cpu"))
+                    )


can this be rearranged to remove some duplication? Something like:

for rank in range(self.world_size): if rank == self.rank: state = optim.state_dict() sd["num_padded"] = ... state = broadcast_object(state, src_rank=rank, ...) else: state = broadcast_object(None, src_rank=rank, ...) if should_collect_state: _all_optimizer_states.append(recursive_copy_to_device(state, device=torch.device("cpu"))

Just copy pasted this func from OSS. I think the reason for the extra append is to save useless communication from recipient_rank to recipient_rank

I have the simplified implem working with torch.distributed.broadcast_object_list.
I no longer need compute_device. Still calling lazy_init_ for safety.

min-xu-ai · 2021-03-22T18:23:54Z

Looks like really cool stuff. Some high level questions about the context:

in the example, do users need to put the optimizer back into it is original state after the consolidation? If not, perhaps make a comment about it in the example?
this assumes there is enough GPU memory to hold the state at each or one rank? what's the solution for very large models?

It seems that this is only needed when we change world size between save/restore? If the world size not changed, normal save/restore with the only the sharded data is OK?

sshleifer · 2021-03-22T19:39:17Z

@min-xu-ai

No resetting is needed by the user. This doesn't mutate optimizer it just combines optimizer.state_dict between ranks. I'll add a comment and test to that effect.
Consolidation is happening in CPU memory (the cast is in the consolidate method). If we see use cases where there is not enough CPU memory to handle consolidation on one node, we will iterate :)

myleott

Do we envision a use case for calling consolidate_optim_state_dict without calling gather_full_optim_state_dict?

If not, perhaps simplify interface to:

fsdp = FSDP(world_size=4)
optim = Adam(fsdp.parameters())
full_state_dict = fsdp.gather_full_optim_state_dict(optim, recipient_rank=-1)

myleott · 2021-03-22T16:58:06Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+        # combined_state refers to tensor values in sd[state][param_id].
+        # Here we just aggregate them into a list inside the dictionary from a list of dictionaries.
+        combined_state = self._combine_tensor_optim_state(
+            [x["state"] for x in self._all_optimizer_states], self.world_size
+        )
+
+        # constant_state refers to entries in sd[state][param_id] that are not tensors, like "step"
+        # we check that these are identical across workers and then take the first
+        constant_state = [self._extract_constant_state(combined_state, id) for id in combined_state]


these comments/helper methods are very nice 😄

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

myleott · 2021-03-22T17:07:46Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+
+            if next_global_param_id == 0:  # stateless optimizer
+                num_params = sum([len(m._param_numels) for m in instance_list])  # type: ignore
+                new_state_dict["param_groups"][pg_id]["params"] = list(range(num_params))


this list could be quite large, right? I guess this only affects SGD w/o momentum, but I wonder if there's a more compact way. Let's not worry about it for now, but perhaps put a note or TODO to make it more efficient

Are you talking about list(range(num_params))? If so, it affects both cases.
I'll leave a TODO

myleott · 2021-03-23T14:34:02Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

+#
+# This source code is licensed under the BSD license found in the
+# LICENSE file in the root directory of this source tree.
+"""These files are used by fsdp to help consolidate and shard optimizer states."""


❤️ this

min-xu-ai

Seems like super solid work. I added some minor comments. I didn't check the logic in detail mainly because I have two high level questions:

should we consider some optimizer wrapper that work together with fsdp to get the full state? It seems right now everything is in fsdp. Will an optimizer wrapper help more? I haven't thought through this.
I have been thinking that fsdp should support a "streaming" mode for full state so that no single rank's work need to hold all state (non-shard state) in memory. Should this PR try to do streaming to avoid overly big state?

Both 1 and 2 above are kind of independent of PR. Just wanted to put them out there in case they are helpful. If not, just let me know and I will dive deep into this version of the code and give it a more detailed review. Thanks!

fairscale/nn/data_parallel/fsdp_optim_utils.py

min-xu-ai · 2021-03-23T19:29:22Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -19,9 +19,10 @@
 from torch.nn import Parameter
 import torch.nn.functional as F

+import fairscale.nn.data_parallel.fsdp_optim_utils as ou


relative import like `import .fsdp_optim_utils as ou" is more portable?

SyntaxError: invalid syntax :(

got it. perhaps from . import fsdp_optim_utils as ou?

That works!

min-xu-ai · 2021-03-23T19:29:40Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -88,8 +89,8 @@ class FullyShardedDataParallel(nn.Module):
        import torch
        from fairscale.nn.auto_wrap import enable_wrap, auto_wrap
        from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP
-        fsdp_params = dict(mixed_precision=True, flatten_parameters=True)
-        with enable_wrap(wrapper_cls=FSDP, **fsdp_params):
+        fsdp_params = dict(wrapper_cls=FSDP, mixed_precision=True, flatten_parameters=True)


Thanks for fixing the doc here!

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

min-xu-ai

Reviewed the test file first. It looks very good. Minor comments.

min-xu-ai · 2021-03-24T04:23:19Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+
+from parameterized import parameterized
+import torch
+from torch.optim import SGD, Adadelta, Adam  # type: ignore


we usually don't have typing in test files. so "type: ignore" is not needed?

I was getting "torch.optim has no Attribute Adadelta" from mypy without this, using

mypy --ignore-missing-imports --scripts-are-modules --pretty .

from fs_test.

I see. magic mypy. I thought it would skip the whole file since there isn't any type annotation in it.

min-xu-ai · 2021-03-24T04:28:25Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+        try:
+            fsdp_optim = optim_fn(fsdp.parameters(), lr=0.01,)
+            optim_unwrapped = optim_fn(unwrapped_model.parameters(), lr=0.01)
+        except TypeError:  # AdaScale


do you actually mean "AdaScale" here? I don't see AdaScale being used here in this test.

yes, nice catch

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

min-xu-ai · 2021-03-24T04:33:30Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+        # Switching from fairscale.optim.utils.broadcast_object to torch.broadcast_object_list will cause this to raise
+        assert duration < fsdp.world_size, f"gather optim state took {duration} seconds, suspect change in _consolidate"


this is interesting. thanks for the comment. what the usual value for duration? I am surprised that it is somehow connected with world_size, which is not in the unit of seconds even.

It takes longer to gather from 8 nodes than 4 than 2.
This actually takes 4 ms, but I accidentally regressed it during development and caused it to take 8 seconds for world size 2, 13 for world size 4.
Now that it's fixed I want to prevent it happening again, agreed that the units are arbitrary.

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

min-xu-ai · 2021-03-24T04:40:00Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+            sum([first_tensor_shape(v) for k, v in sd["state"].items()]),
+            sum([first_tensor_shape(v) for k, v in unwrapped_sd["state"].items()]),


perhaps norm will be slightly better than sum for comparison in case both tensors sum to the same values? same with line 110, 111.

This just checks that we have the same num elements as the base model after unflattening.
I renamed first_tensor_shape -> first_tensor_numel to make it clearer.

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

min-xu-ai

Reviewed fsdp changes. I am not sure if nested FSDP cases are well supported by this change.

the APIs are really only intended for the root instance?
root and all inner instances should have flatten == True?
all instance needs to have world_size == default world_size?

If so, can you assert those are the cases in the APIs so that we don't accidentally produce incorrect optim states or crash with non-obvious errors?

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

min-xu-ai · 2021-03-24T05:07:36Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+    def _consolidate_optim_state_dict(
+        self, optim: torch.optim.Optimizer, recipient_rank: Optional[int] = None
+    ) -> List[Dict]:
+        """Update the consolidated state_dict list, one per rank.


should this be called only on the root FSDP instance?

Yes, more specifically it should be called on the instance that was the argument to optimizer(model.parameters(). Are there other cases?

min-xu-ai · 2021-03-24T05:08:11Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+        should_collect_state = recipient_rank is None or (self.rank == recipient_rank)
+        all_states: List[Dict[str, Any]] = []
+        dummy_tensor = torch.tensor([0], dtype=torch.uint8, device=self.compute_device)
+        for rank in range(self.world_size):


there might be complications here when nested FSDP instance have different world_size, right? For example, if BN layers are in their own world_size == 1 process groups, then we collect duplicated states for them? add a TODO?

added TODO in the caller

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

min-xu-ai · 2021-03-24T05:21:22Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+        # Assert nesting is the same as it was at save time
+        instance_list = self._fsdp_instances
+        ou.check_param_counts_before_sharding(full_optim_state_dict, len(instance_list))
+        if self.flatten_parameters:


does this assume all inner FSDP instances also have flatten == True?

Yes, will assert

min-xu-ai · 2021-03-24T05:25:27Z

fairscale/nn/misc/flatten_params_wrapper.py

@@ -122,15 +122,15 @@ def _flatten_params(self, flat_param: Optional[nn.Parameter] = None) -> None:
        # register the views as plain attributes
        self._unflatten_params_as_views()

-    def _get_param_views(self, flat_param: Tensor) -> Generator:
+    def get_param_views(self, flat_param: Tensor) -> Generator:


since this is becoming an public method, can you please:

add docstring with proper doc

assert flat_param is valid before using it?

sshleifer

Thanks for the comments!

sshleifer · 2021-03-24T14:11:13Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+    def _consolidate_optim_state_dict(
+        self, optim: torch.optim.Optimizer, recipient_rank: Optional[int] = None
+    ) -> List[Dict]:
+        """Update the consolidated state_dict list, one per rank.


Yes, more specifically it should be called on the instance that was the argument to optimizer(model.parameters(). Are there other cases?

sshleifer · 2021-03-24T14:18:11Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+        should_collect_state = recipient_rank is None or (self.rank == recipient_rank)
+        all_states: List[Dict[str, Any]] = []
+        dummy_tensor = torch.tensor([0], dtype=torch.uint8, device=self.compute_device)
+        for rank in range(self.world_size):


added TODO in the caller

sshleifer · 2021-03-24T14:19:12Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+        # Assert nesting is the same as it was at save time
+        instance_list = self._fsdp_instances
+        ou.check_param_counts_before_sharding(full_optim_state_dict, len(instance_list))
+        if self.flatten_parameters:


Yes, will assert

sshleifer · 2021-03-24T14:20:22Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+
+from parameterized import parameterized
+import torch
+from torch.optim import SGD, Adadelta, Adam  # type: ignore


I was getting "torch.optim has no Attribute Adadelta" from mypy without this, using

mypy --ignore-missing-imports --scripts-are-modules --pretty .

from fs_test.

sshleifer · 2021-03-24T14:22:03Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+        try:
+            fsdp_optim = optim_fn(fsdp.parameters(), lr=0.01,)
+            optim_unwrapped = optim_fn(unwrapped_model.parameters(), lr=0.01)
+        except TypeError:  # AdaScale


yes, nice catch

sshleifer · 2021-03-24T14:23:28Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+        # Switching from fairscale.optim.utils.broadcast_object to torch.broadcast_object_list will cause this to raise
+        assert duration < fsdp.world_size, f"gather optim state took {duration} seconds, suspect change in _consolidate"


It takes longer to gather from 8 nodes than 4 than 2.
This actually takes 4 ms, but I accidentally regressed it during development and caused it to take 8 seconds for world size 2, 13 for world size 4.
Now that it's fixed I want to prevent it happening again, agreed that the units are arbitrary.

sshleifer · 2021-03-24T14:31:58Z

tests/nn/data_parallel/test_fsdp_optimizer_utils.py

+            sum([first_tensor_shape(v) for k, v in sd["state"].items()]),
+            sum([first_tensor_shape(v) for k, v in unwrapped_sd["state"].items()]),


This just checks that we have the same num elements as the base model after unflattening.
I renamed first_tensor_shape -> first_tensor_numel to make it clearer.

min-xu-ai

Finished reviewing. Great step forward. I wished there were more comments in the fsdp_optim_utils.py for me follow along better. I tried my best and it seems to make sense. It might be able to be simplified and individually tested. But we can iterated on them later as we learn more.

fairscale/nn/data_parallel/fsdp_optim_utils.py

min-xu-ai · 2021-03-24T16:13:14Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

+    return unflat_state, global_to_local_id
+
+
+def build_unflat_state_dict(instance_list: List[torch.nn.Module], world_optim_states: List[Dict]) -> Dict:


add a docstring?

Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

sshleifer · 2021-03-25T00:43:28Z

I'm gunna merge this tomorrow AM unless further comments or CI failure @myleott

sshleifer added 4 commits March 18, 2021 20:47

consolidate works

ee088bb

cat

ad7df24

Unpad before cat

ed7526a

update params list

ed75c59

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2021

sshleifer added 2 commits March 19, 2021 00:35

simple case passing

44158f7

found other bug

f82f3b6

sshleifer linked an issue Mar 19, 2021 that may be closed by this pull request

FSDP with optimizer groups differs from DDP #539

Open

sshleifer removed a link to an issue Mar 19, 2021

FSDP with optimizer groups differs from DDP #539

Open

sshleifer added 6 commits March 19, 2021 01:41

Broken tests for other optimizers

1022e1e

boom boom

75119c2

Merge branch 'master' into fsdp-gather-optimizer

89947a4

remove oss changes

8dcf0a8

passing besides mypy

2caf928

Smaller delta

0b888fd

sshleifer changed the title ~~[wip] [FSDP][feature] optimizer state dict save and load~~ [FSDP][feature] optimizer state dict save and load Mar 20, 2021

sshleifer added 4 commits March 21, 2021 15:31

Nesting works

a2aacd0

passing, lint attempt

0fc045d

merge master

d859734

update test list

3635277

sshleifer marked this pull request as ready for review March 21, 2021 21:40

sshleifer requested review from min-xu-ai and myleott March 21, 2021 21:41

sshleifer commented Mar 22, 2021

View reviewed changes

mypy

dbb426f

myleott reviewed Mar 22, 2021

View reviewed changes

Simpler consolidate_optim_state_dict

f537632

slightly cleaner

a04b406

myleott reviewed Mar 22, 2021

View reviewed changes

sshleifer added 5 commits March 22, 2021 16:32

Simplified signature, helper fn for unflattening

e5e91df

add todo

ea9d4b5

Give CI more time to show me a traceback

47e7cba

Fix broadcast_object regression

6cebcec

Move most dictionary manipulation to fsdp_optim_utils.py

93c0857

myleott reviewed Mar 23, 2021

View reviewed changes

sshleifer added 4 commits March 23, 2021 15:14

passing

c93d1db

style

13b0537

passing

9d3dfb7

stateless fix

9f619b2

sshleifer added the FSDP FullyShardedDataParallel (zero-3) label Mar 23, 2021

min-xu-ai reviewed Mar 23, 2021

View reviewed changes

Min comments

a4778b7

min-xu-ai reviewed Mar 24, 2021

View reviewed changes

sshleifer commented Mar 24, 2021

View reviewed changes

Min comments

c77a9f7

min-xu-ai approved these changes Mar 24, 2021

View reviewed changes

sshleifer and others added 3 commits March 24, 2021 17:53

Apply suggestions from code review

aeefe69

Co-authored-by: Min Xu <24926999+min-xu-ai@users.noreply.github.com>

Min comments

d645337

also test param groups

75bdd3f

sshleifer merged commit 9474d75 into master Mar 25, 2021

sshleifer deleted the fsdp-gather-optimizer branch March 25, 2021 15:03

		# Switching from fairscale.optim.utils.broadcast_object to torch.broadcast_object_list will cause this to raise
		assert duration < fsdp.world_size, f"gather optim state took {duration} seconds, suspect change in _consolidate"

		sum([first_tensor_shape(v) for k, v in sd["state"].items()]),
		sum([first_tensor_shape(v) for k, v in unwrapped_sd["state"].items()]),

		return unflat_state, global_to_local_id


		def build_unflat_state_dict(instance_list: List[torch.nn.Module], world_optim_states: List[Dict]) -> Dict:

[FSDP][feature] optimizer state dict save and load #537

[FSDP][feature] optimizer state dict save and load #537

Conversation

sshleifer commented Mar 19, 2021 • edited Loading

Overview

Future Work

Assumptions

New overhead introduced

Choose a reason for hiding this comment

myleott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

sshleifer Mar 22, 2021 • edited Loading

Choose a reason for hiding this comment

min-xu-ai commented Mar 22, 2021

sshleifer commented Mar 22, 2021

myleott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myleott Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

min-xu-ai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sshleifer commented Mar 25, 2021

sshleifer commented Mar 19, 2021 •

edited

Loading

sshleifer Mar 22, 2021 •

edited

Loading

sshleifer Mar 22, 2021 •

edited

Loading

myleott Mar 23, 2021 •

edited

Loading