Extend auto shard capabilities to work around torch.fx edge cases. #817

EugenHotaj · 2021-10-19T19:26:50Z

auto_shard.py currently uses torch.fx to create a symbolic DAG of
operations and linearizes that DAG into an nn.Sequential so it can later
be used for model offloading. This works in most cases but runs into
issues for certain eager mode features, such as dynamic conditionals,
shape-dependent computation, etc.

This PR extends auto_shard.py to first run a preprocessing step which wraps
any nn.Module which cannot be traced through. It adds a test for dynamic
conditionals and updates existing failing test code.

There are some immediate extensions to this approach which are marked as
TODO in the code.

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

tests/experimental/nn/test_auto_shard.py

anj-s · 2021-10-21T00:32:14Z

tests/experimental/nn/test_auto_shard.py

+    sharded_model = shard_model(model, 3)
+    # TODO(ehotaj): There might be a bug in our split code because we shard the
+    # model into 10 shards even though we specify 3 shards above.
+    assert len(sharded_model) == 10


Can you print out the original model and sharded model?

Here is the full model:

BranchedNetwork( (net): ModuleList( (0): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (1): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (2): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (3): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (4): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (5): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (6): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (7): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (8): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) (9): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) )

And the sharded model:

[GraphModule( (net): Module( (0): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (1): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (2): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (3): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (4): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (5): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (6): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (7): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (8): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) ), GraphModule( (net): Module( (9): Branch( (left): Linear(in_features=10, out_features=10, bias=True) (right): Linear(in_features=10, out_features=10, bias=True) ) ) )]

Looks reasonable to me but I'm not sure why the shard_count is not being respected.

I'm not sure why the shard_count is not being respected.

BTW, this also happens without the changes in this PR, i.e. if we just use torch.fx.symbolic_trace directly, so I think something might be going wrong in the _split_nodes logic.

(I could try looking into it in a follow up PR).

auto_shard.py currently uses torch.fx to create a symbolic DAG of operations and linearizes that DAG into an nn.Sequential so it can later be used for model offloading. This works in most cases but runs into issues for certain eager mode features, such as dynamic conditionals, shape-dependent computation, etc. This PR extends auto_shard.py to first run a preprocessing step which wraps any nn.Module which cannot be traced through. It adds a test for dynamic conditionals and updates existing failing test code. There are some immediate extensions to this approach which are marked as TODO in the code.

) auto_shard.py currently uses torch.fx to create a symbolic DAG of operations and linearizes that DAG into an nn.Sequential so it can later be used for model offloading. This works in most cases but runs into issues for certain eager mode features, such as dynamic conditionals, shape-dependent computation, etc. This PR extends auto_shard.py to first run a preprocessing step which wraps any nn.Module which cannot be traced through. It adds a test for dynamic conditionals and updates existing failing test code. There are some immediate extensions to this approach which are marked as TODO in the code.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 19, 2021

EugenHotaj requested a review from anj-s October 19, 2021 19:27

EugenHotaj force-pushed the autoshard branch from d269225 to 48e3582 Compare October 19, 2021 20:22

anj-s reviewed Oct 21, 2021

View reviewed changes

tests/experimental/nn/test_auto_shard.py Show resolved Hide resolved

anj-s reviewed Oct 21, 2021

View reviewed changes

EugenHotaj force-pushed the autoshard branch from 48e3582 to 4d8e5a5 Compare October 21, 2021 00:57

EugenHotaj force-pushed the autoshard branch from 4d8e5a5 to e55fdf6 Compare October 21, 2021 00:58

anj-s approved these changes Oct 22, 2021

View reviewed changes

anj-s merged commit 7bdf50a into main Oct 22, 2021

anj-s deleted the autoshard branch October 22, 2021 12:24

EugenHotaj mentioned this pull request Oct 22, 2021

shard_model does not respect shard_count. #827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend auto shard capabilities to work around torch.fx edge cases. #817

Extend auto shard capabilities to work around torch.fx edge cases. #817

EugenHotaj commented Oct 19, 2021 •

edited

Loading

anj-s Oct 21, 2021

EugenHotaj Oct 21, 2021 •

edited

Loading

EugenHotaj Oct 21, 2021

Extend auto shard capabilities to work around torch.fx edge cases. #817

Extend auto shard capabilities to work around torch.fx edge cases. #817

Conversation

EugenHotaj commented Oct 19, 2021 • edited Loading

Before submitting

PR review

anj-s Oct 21, 2021

Choose a reason for hiding this comment

EugenHotaj Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

EugenHotaj Oct 21, 2021

Choose a reason for hiding this comment

EugenHotaj commented Oct 19, 2021 •

edited

Loading

EugenHotaj Oct 21, 2021 •

edited

Loading