Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add async model average algorithm #110

Merged
merged 34 commits into from
Aug 23, 2021
Merged

Conversation

NOBLES5E
Copy link
Contributor

@NOBLES5E NOBLES5E commented Jul 7, 2021

No description provided.

@pr-triage pr-triage bot added the PR: draft label Jul 7, 2021
bagua/torch_api/bucket.py Outdated Show resolved Hide resolved
wangraying and others added 2 commits August 6, 2021 17:03
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@wangraying wangraying requested a review from a team August 6, 2021 10:13
@wangraying wangraying marked this pull request as ready for review August 6, 2021 10:13
wangraying and others added 2 commits August 6, 2021 18:48
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
wangraying and others added 4 commits August 6, 2021 21:27
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
wangraying and others added 3 commits August 12, 2021 23:02
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@wangraying
Copy link
Member

wangraying commented Aug 19, 2021

update @NOBLES5E

It seems using LL128 protocol will cause the problem to hang. I have reproduced the problem through a sample code backend by torch+cupy. This is consistent with the results on Bagua.

Using the other two protocols LL and Simple does not hang as far as I know.

I have posted an issue on github, which we could track later on.

@pr-triage pr-triage bot removed the PR: draft label Aug 20, 2021
@todo
Copy link

todo bot commented Aug 20, 2021

remove nccl proto check

# TODO: remove nccl proto check
proto_str = os.environ.get("NCCL_PROTO", "")
if (
proto_str == ""
or ("^" not in proto_str and "LL128" in proto_str)
or ("^" in proto_str and "LL128" not in proto_str)


This comment was generated by todo based on a TODO comment in f550bcd in #110. cc @BaguaSys.

@wangraying
Copy link
Member

@wangraying fix CI so that we can merge this
We also need a tutorial page on this algorithm

This CI seems broken.

@wangraying
Copy link
Member

wangraying commented Aug 20, 2021

And do you have any further comment about the usage, since we add a function barrier to end async threads.

@NOBLES5E

@todo
Copy link

todo bot commented Aug 23, 2021

; remove this after NVIDIA/nccl#549 gets solved

) # TODO; remove this after https://github.com/NVIDIA/nccl/issues/549 gets solved
class AsyncModelAverageAlgorithm(Algorithm):
def __init__(
self, peer_selection_mode: str = "all", sync_interval_ms: int = 500,


This comment was generated by todo based on a TODO comment in 1c33cf9 in #110. cc @BaguaSys.

Copy link
Contributor Author

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refined the texts and left some comments. @wangraying Please do a final check

bagua/torch_api/bucket.py Outdated Show resolved Hide resolved
NOBLES5E and others added 3 commits August 22, 2021 20:37
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@NOBLES5E NOBLES5E merged commit fcef9ef into master Aug 23, 2021
@NOBLES5E NOBLES5E deleted the feat/async-model-average branch August 23, 2021 04:53
@todo todo bot mentioned this pull request Aug 23, 2021
@pr-triage pr-triage bot added the PR: merged label Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants