Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

Merged
merged 3 commits into from
Apr 4, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions fairscale/optim/oss.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,9 @@ def consolidate_state_dict(self, recipient_rank: int = 0) -> None:
should_collect_state = self.rank == recipient_rank or recipient_rank == -1
should_send_state = (self.rank != recipient_rank and recipient_rank != -1) or recipient_rank == -1
blefaudeux marked this conversation as resolved.
Show resolved Hide resolved

# NCCL requires CUDA tensors for all communication primitives
dist_device = torch.device("cuda") if self.backend == dist.Backend.NCCL else self._default_device
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no choice with NCCL, needs to be cuda

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the model is moved back to the cpu and the optimizer state reflects it, why do we call broadcast? The optimizer state is not sharded anymore right? Maybe i am missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the framework is the one calling .consolidate(), it can do so at any time basically. We could add a skip mechanism for when it's called twice in a row (would be even more foolproof actually), but that would not solve the case of train -> move to cpu -> call .consolidate(), which can be legitimate, if unfortunate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(complement) the issue was that if the model is moved to cpu, then some tensors in the optimizer dict are cpu. When consolidating the shards are exchanged towards a specific rank (or all), which breaks with NCCL since it always expects cuda for communication primitives


for rank in range(self.world_size):
if rank == self.rank:
if should_collect_state:
Expand All @@ -340,18 +343,18 @@ def consolidate_state_dict(self, recipient_rank: int = 0) -> None:
state_to_share = (
self.optim.state_dict()
if should_send_state
else torch.tensor([0], dtype=torch.uint8, device=self._default_device)
else torch.tensor([0], dtype=torch.uint8, device=dist_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wasteful. Why not skip the broadcast in this case instead of sending a zero? In the else below you could check if rank != recipient.

)
broadcast_object(
state_to_share, src_rank=self.global_rank, group=self.group, dist_device=self._default_device,
state_to_share, src_rank=self.global_rank, group=self.group, dist_device=dist_device,
)
else:
# Fetch the optim state from the other replicas
replica_state = broadcast_object(
torch.tensor([0], dtype=torch.uint8, device=self._default_device),
torch.tensor([0], dtype=torch.uint8, device=dist_device),
src_rank=self._local_to_global_rank[rank],
group=self.group,
dist_device=self._default_device,
dist_device=dist_device,
)

if should_collect_state:
Expand Down
5 changes: 5 additions & 0 deletions tests/optim/test_oss.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,11 @@ def closure():
_ = optimizer.step(closure=closure)
check_same_models_across_ranks(model, dist.group.WORLD, params_should_be_equal=True, check_broadcast_buffers=False)

# Check that if the model is moved to cpu, the optimizer consolidation still works
model.cpu()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the fix, this unit test does fail with the same error that the user mentioned

optimizer = optim.OSS(model.parameters(), lr=0.1, momentum=0.99)
optimizer.consolidate_state_dict(recipient_rank=reference_rank)

dist.destroy_process_group()


Expand Down