-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): associate PyTorch Process Group with Bagua Process Group using cache #402
Conversation
bagua/torch_api/communication.py
Outdated
@@ -49,6 +49,8 @@ | |||
# Process group count for default naming | |||
_group_count = 0 | |||
|
|||
# Torch process group to bagua process group | |||
_torch_to_bagua_pg_map = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the process group is destroyed? It seems that they will never be released in current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TorchProcessGroup destory has not be handle yet, I have no good way deal with it.
I try to patch bagua pg on TorchProcessGroup, but if patch class is NCCLProcessGroup, it not work. because C object (NCCLProcessGroup) not support add attribute. The final plan is as it is now. Any suggestions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
No description provided.