Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, core): support mutable bucket tensors #271

Merged
merged 38 commits into from
Oct 28, 2021
Merged

Conversation

wangraying
Copy link
Member

@wangraying wangraying commented Oct 9, 2021

BREAKING CHANGE:

  • BaguaTensor::bagua_ensure_grad returns the tensor itself now
  • BaguaTensor::bagua_set_storage is renamed to BaguaTensor::bagua_set_registered_storage

bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
@wangraying wangraying changed the title fix: refactor bagua tensor fix(python, core): refactor bagua tensor Oct 20, 2021
Create a zero gradient tensor for the current parameter if not exist.

Returns:
The original tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking change

Copy link
Contributor

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangraying wangraying changed the title fix(python, core): refactor bagua tensor fix(python, core): support mutable bucket tensors Oct 21, 2021
@wangraying
Copy link
Member Author

close #287

@@ -10,29 +10,54 @@
@gorilla.patches(torch.Tensor, filter=lambda name, obj: "bagua" in name)
class BaguaTensor:
"""
This class patch `torch.Tensor <https://pytorch.org/docs/stable/tensors.html?highlight=tensor#torch.Tensor>`_ with additional methods.
This class patch `torch.Tensor <https://pytorch.org/docs/stable/tensors.html?highlight=tensor#torch.Tensor>`_
with additional methods.
Copy link
Contributor

@NOBLES5E NOBLES5E Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bagua Tensor features a proxy structure, where the actual tensor used by backend is accessed via a "Proxy Tensor".
The proxy tensor is registered in Bagua, whenever the Bagua backend needs a tensor (for example use it for communication), it calls
the `getter_closure` on the proxy tensor to get the tensor that is actually worked on. We call this tensor "Effective Tensor".
Their relation can be seen in the following diagram:


             ┌───────────────┐
             │ Bagua Backend │
             └──────▲────────┘
                    │
                  access
                    │
   ┌────────────────┼────────────────┐
   │Bagua Tensor    │                │
   │        ┌───────┴────────┐       │
   │        │  Proxy Tensor  │       │
   │        └───┬──────▲─────┘       │
   │            │      │             │
   │ setter_closure  getter_closure  │
   │            │      │             │
   │     ┌──────▼──────┴───────┐     │
   │     │  Effective Tensor   │     │
   │     └─────────────────────┘     │
   │                                 │
   └─────────────────────────────────┘

For example, in the gradient allreduce algorithm, the effective tensor that
needs to be exchanged between machines is the gradient.  In this case, we will
register the model parameters as proxy tensor, and register `getter_closure` to
be `lambda proxy_tensor: proxy_tensor.grad`. In this way, even if the gradient
tensor is recreated or changed during runtime, Bagua can still use the correct
tensor for communication, since the `proxy_tensor` serves as the root for
access and is never replaced.

The `setter_closure` is used to replace the effective tensor during runtime. It
is intended to be used to replace the effective tensor with customized
workflow.

getter_closure takes the registered the tensor as input and returns a PyTorch tensor.

setter_closure takes XXXX and returns XXX. 

bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
"""
Sets the underlying storage using an existing `torch.Storage <https://pytorch.org/docs/stable/storage.html?highlight=storage>`_.
Sets the underlying storage for the tensor returned by :meth:`bagua_getter_closure` with an existing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the terminology we defined in class documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar for other methods' doc

Copy link
Contributor

@NOBLES5E NOBLES5E left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
bagua/torch_api/tensor.py Outdated Show resolved Hide resolved
@NOBLES5E NOBLES5E merged commit de7df6e into master Oct 28, 2021
@NOBLES5E NOBLES5E deleted the bucket-tensor branch October 28, 2021 06:18
NOBLES5E pushed a commit that referenced this pull request Oct 28, 2021
BREAKING CHANGE: `BaguaTensor::bagua_ensure_grad` returns the tensor itself now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants