Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, core): support process group in with_bagua, support hierarchical communication in bytegrad algorithm #300

Merged
merged 46 commits into from
Oct 21, 2021

Conversation

wangraying
Copy link
Member

@wangraying wangraying commented Oct 15, 2021

BREAKING CHANGE:

  • AlgorithmImpl must pass a process group to its __init__ method
  • ByteGradAlgorithm can accept a parameter to enable hierarchical communication
  • decentralized_synchronous_op_copy_back_peer_weight is now removed from BaguaBucket, call copy_back_peer_weight on decentralized synchronous op instead

@wangraying wangraying mentioned this pull request Oct 15, 2021
@wangraying wangraying changed the title fix: process group for with_bagua fix(python): process group for with_bagua Oct 15, 2021
@wangraying
Copy link
Member Author

wangraying commented Oct 16, 2021

needs to be merged after #298

@wangraying wangraying changed the title fix(python): process group for with_bagua fix(python): enhancement for with_bagua Oct 16, 2021
@wangraying
Copy link
Member Author

wangraying commented Oct 20, 2021

suport arbitrary number of ranks of process group for bytegrad and qadam

# TODO: suport odd number of ranks of process group for bytegrad and qadam
nprocs = torch.cuda.device_count()
self.run_algorithm(nprocs, nprocs // 2, run_model_wrapper, algorithm="bytegrad")
@skip_if_cuda_not_available()
def test_qadam(self):

This comment was generated by todo based on a TODO comment in 135a9c4 in #300. cc @BaguaSys.

solved by passing process_group on algorithm initialization, so that it could align its buckets to theprocess_group's size

@wangraying wangraying changed the title fix(python): enhancement for with_bagua fix(python): support process group in with_bagua Oct 20, 2021
@wangraying wangraying changed the title fix(python): support process group in with_bagua fix(python, core): support process group in with_bagua Oct 20, 2021
@@ -42,7 +42,7 @@ def ensure_bagua_tensor(
assert (
self.bagua_tensor_name == name
), "assigning a different name to existing bagua tensor is forbidden"
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return self if it is already a bagua_tensor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in accord with #271, still need to skip return self

Copy link
Member Author

@wangraying wangraying Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each with_bagua will generate a new module name, which leads to a dismatch between module._bagua_backend and bucket._bagua_backend.

Copy link
Contributor

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@NOBLES5E NOBLES5E changed the title fix(python, core): support process group in with_bagua feat(python, core): support process group in with_bagua Oct 20, 2021
@BaguaSys BaguaSys deleted a comment from todo bot Oct 21, 2021
@NOBLES5E NOBLES5E changed the title feat(python, core): support process group in with_bagua feat(python, core): support process group in with_bagua, support hierarchical communication in bytegrad algorithm Oct 21, 2021
@NOBLES5E NOBLES5E merged commit 4e1adda into master Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants