feat(python, core): support process group in `with_bagua`, support hierarchical communication in bytegrad algorithm #300

wangraying · 2021-10-15T11:13:14Z

BREAKING CHANGE:

AlgorithmImpl must pass a process group to its __init__ method
ByteGradAlgorithm can accept a parameter to enable hierarchical communication
decentralized_synchronous_op_copy_back_peer_weight is now removed from BaguaBucket, call copy_back_peer_weight on decentralized synchronous op instead

…to swith-to-model

wangraying · 2021-10-16T03:38:23Z

needs to be merged after #298

wangraying · 2021-10-20T04:03:52Z

suport arbitrary number of ranks of process group for bytegrad and qadam

bagua/tests/torch_api/test_bagua_module.py

Lines 168 to 173 in 135a9c4

# TODO: suport odd number of ranks of process group for bytegrad and qadam

nprocs = torch.cuda.device_count()

self.run_algorithm(nprocs, nprocs // 2, run_model_wrapper, algorithm="bytegrad")

@skip_if_cuda_not_available()

def test_qadam(self):

This comment was generated by todo based on a TODO comment in 135a9c4 in #300. cc @BaguaSys.

solved by passing process_group on algorithm initialization, so that it could align its buckets to theprocess_group's size

…o with-bagua-pg

bagua/torch_api/distributed.py

NOBLES5E · 2021-10-20T17:33:34Z

bagua/torch_api/tensor.py

@@ -42,7 +42,7 @@ def ensure_bagua_tensor(
                assert (
                    self.bagua_tensor_name == name
                ), "assigning a different name to existing bagua tensor is forbidden"
-            return self


why not return self if it is already a bagua_tensor?

in accord with #271, still need to skip return self

Each with_bagua will generate a new module name, which leads to a dismatch between module._bagua_backend and bucket._bagua_backend.

NOBLES5E

see comments

wangraying added 6 commits October 15, 2021 15:32

fix: async algorithm

0f48ce0

fix: process group

78e1f79

.

1892a64

add switch tests

3a7e6da

support switch to model

b1a574a

Merge branch 'master' into with-bagua-pg

896077e

pr-triage bot added the PR: unreviewed label Oct 15, 2021

wangraying added 2 commits October 15, 2021 19:16

.

9e33ca6

fmt

c245e73

wangraying mentioned this pull request Oct 15, 2021

support process_group #184

Closed

wangraying added 3 commits October 15, 2021 19:56

move tests to another pr

621c61c

add tests for async

0037c4e

add

7d60bf9

wangraying changed the title ~~fix: process group for with_bagua~~ fix(python): process group for with_bagua Oct 15, 2021

wangraying added 7 commits October 16, 2021 09:54

.

ba0d2a5

fix

ee9839c

Merge branch 'swith-to-model' of https://github.com/BaguaSys/bagua in…

87e2fa3

…to swith-to-model

add

5ea1b58

Merge branch 'async-hang' into with-bagua-pg

63ada2a

fix for decent

d786547

for low prec decent

1ab1163

wangraying requested a review from NOBLES5E October 16, 2021 03:38

wangraying added 3 commits October 16, 2021 14:22

Merge branch 'master' into with-bagua-pg

c05251a

add tests

bcfb422

Merge branch 'swith-to-model' into with-bagua-pg

1ed006a

wangraying mentioned this pull request Oct 16, 2021

feat(python): support switching between different algorithms #299

Merged

wangraying changed the title ~~fix(python): process group for with_bagua~~ fix(python): enhancement for with_bagua Oct 16, 2021

wangraying added 2 commits October 17, 2021 08:56

tmp save

bee2a73

fix

cc3c715

for docs

ee69d43

wangraying changed the title ~~fix(python): enhancement for with_bagua~~ fix(python): support process group in with_bagua Oct 20, 2021

wangraying added 8 commits October 20, 2021 15:22

...

f33484d

Merge branch 'with-bagua-pg' of https://github.com/BaguaSys/bagua int…

fbee978

…o with-bagua-pg

fix decent

b7de217

doc

646a19d

fix test

23102eb

Merge branch 'with-bagua-pg' of https://github.com/BaguaSys/bagua int…

b04348f

…o with-bagua-pg

fix deterministic

95f4f23

fix qadam

dfd81f9

wangraying changed the title ~~fix(python): support process group in with_bagua~~ fix(python, core): support process group in with_bagua Oct 20, 2021

NOBLES5E reviewed Oct 20, 2021

View reviewed changes

bagua/torch_api/distributed.py Show resolved Hide resolved

NOBLES5E reviewed Oct 20, 2021

View reviewed changes

NOBLES5E requested changes Oct 20, 2021

View reviewed changes

pr-triage bot added PR: reviewed-changes-requested and removed PR: unreviewed labels Oct 20, 2021

NOBLES5E changed the title ~~fix(python, core): support process group in with_bagua~~ feat(python, core): support process group in with_bagua Oct 20, 2021

wangraying requested a review from NOBLES5E October 21, 2021 02:55

pr-triage bot added PR: unreviewed and removed PR: reviewed-changes-requested labels Oct 21, 2021

fix ensure_bagua_tensor

2aac918

wangraying force-pushed the with-bagua-pg branch from 5bc98c9 to 2aac918 Compare October 21, 2021 07:08

add assertion

6f0c647

BaguaSys deleted a comment from todo bot Oct 21, 2021

update test

4a18101

NOBLES5E changed the title ~~feat(python, core): support process group in with_bagua~~ feat(python, core): support process group in with_bagua, support hierarchical communication in bytegrad algorithm Oct 21, 2021

NOBLES5E merged commit 4e1adda into master Oct 21, 2021

pr-triage bot added PR: merged and removed PR: unreviewed labels Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python, core): support process group in `with_bagua`, support hierarchical communication in bytegrad algorithm #300

feat(python, core): support process group in `with_bagua`, support hierarchical communication in bytegrad algorithm #300

wangraying commented Oct 15, 2021 •

edited

Loading

wangraying commented Oct 16, 2021 •

edited

Loading

wangraying commented Oct 20, 2021 •

edited

Loading

suport arbitrary number of ranks of process group for bytegrad and qadam

This comment was generated by todo based on a `TODO` comment in 135a9c4 in #300. cc @BaguaSys.

NOBLES5E Oct 20, 2021

wangraying Oct 21, 2021

wangraying Oct 21, 2021 •

edited

Loading

NOBLES5E left a comment

feat(python, core): support process group in with_bagua, support hierarchical communication in bytegrad algorithm #300

feat(python, core): support process group in with_bagua, support hierarchical communication in bytegrad algorithm #300

Conversation

wangraying commented Oct 15, 2021 • edited Loading

wangraying commented Oct 16, 2021 • edited Loading

wangraying commented Oct 20, 2021 • edited Loading

suport arbitrary number of ranks of process group for bytegrad and qadam

This comment was generated by todo based on a TODO comment in 135a9c4 in #300. cc @BaguaSys.

NOBLES5E Oct 20, 2021

Choose a reason for hiding this comment

wangraying Oct 21, 2021

Choose a reason for hiding this comment

wangraying Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

NOBLES5E left a comment

Choose a reason for hiding this comment

feat(python, core): support process group in `with_bagua`, support hierarchical communication in bytegrad algorithm #300

feat(python, core): support process group in `with_bagua`, support hierarchical communication in bytegrad algorithm #300

wangraying commented Oct 15, 2021 •

edited

Loading

wangraying commented Oct 16, 2021 •

edited

Loading

wangraying commented Oct 20, 2021 •

edited

Loading

This comment was generated by todo based on a `TODO` comment in 135a9c4 in #300. cc @BaguaSys.

wangraying Oct 21, 2021 •

edited

Loading