Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(distributed): RPC-based distributed training support and add distributed MAML example #83

Merged
merged 8 commits into from
Sep 28, 2022

Conversation

XuehaiPan
Copy link
Member

@XuehaiPan XuehaiPan commented Sep 21, 2022

Description

Describe your changes in detail.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Resolves #57

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

New APIs

  • torchopt.distributed.is_available
  • torchopt.distributed.backward

World and Process Group:

  • torchopt.distributed.auto_init_rpc
  • torchopt.distributed.get_world_info
  • torchopt.distributed.get_world_rank (torchopt.distributed.get_rank)
  • torchopt.distributed.get_world_size
  • torchopt.distributed.get_local_rank
  • torchopt.distributed.get_local_world_size
  • torchopt.distributed.barrier

Wrappers:

  • torchopt.distributed.auto_init_rpc
  • torchopt.distributed.on_rank
  • torchopt.distributed.not_on_rank
  • torchopt.distributed.rank_zero_only
  • torchopt.distributed.rank_non_zero_only

Remote call:

  • torchopt.distributed.remote_async_call
  • torchopt.distributed.remote_sync_call
  • torchopt.distributed.parallelize (torchopt.distributed.parallelize_sync)
  • torchopt.distributed.parallelize_async

Misc:

  • torchopt.distributed.dim_partitioner
  • torchopt.distributed.batch_partitioner
  • torchopt.distributed.exclusive_batch_partitioner
  • torchopt.distributed.mean_reducer
  • torchopt.distributed.sum_reducer

Examples:

import torchopt.distributed as todist

def worker_init_fn():
    # set process title, seeding, etc.

@todist.auto_init_rpc(worker_init_fn)
def main():
    ...
	model = Model(...)
	train(model)        # execute on rank 0 only
    save_model(model)   # execute on rank 0 only
@todist.rank_zero_only
def save_model(model):
    ...

@todist.rank_zero_only
def train(model):
    model_rref = todist.rpc.RRef(model_rref)
    dataloader = DataLoader(...)
    optimizer = Optimizer(...)

    for batch in dataloader:
        optimizer.zero_grad()
        with todist.autograd.context() as context_id:
            loss = compute_loss(model_rref, batch)
            todist.autograd.backward(context_id, loss)  # in single process, we use `loss.backward()`
            optimizer.step()
@todist.parallelize(partitioner=todist.batch_partitioner, reducer=todist.mean_reducer)
def compute_loss(model_rref, batch):
    model = model_rref.to_here()
    loss = ...
    return loss

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the code using make lint (required)
  • I have ensured make test pass. (required)

@XuehaiPan XuehaiPan added enhancement New feature or request pytorch Something PyTorch related example / tutorial Something related to examples or tutorials feature New feature labels Sep 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2022

Codecov Report

Base: 73.20% // Head: 66.59% // Decreases project coverage by -6.60% ⚠️

Coverage data is based on head (d2a3657) compared to base (af6d24c).
Patch coverage: 37.53% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #83      +/-   ##
==========================================
- Coverage   73.20%   66.59%   -6.61%     
==========================================
  Files          33       37       +4     
  Lines        1515     1853     +338     
==========================================
+ Hits         1109     1234     +125     
- Misses        406      619     +213     
Flag Coverage Δ
unittests 66.59% <37.53%> (-6.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
torchopt/utils.py 23.77% <6.25%> (-5.06%) ⬇️
torchopt/distributed/api.py 27.81% <27.81%> (ø)
torchopt/distributed/autograd.py 30.61% <30.61%> (ø)
torchopt/distributed/world.py 53.76% <53.76%> (ø)
torchopt/pytree.py 63.63% <55.55%> (-36.37%) ⬇️
torchopt/typing.py 96.15% <83.33%> (-3.85%) ⬇️
torchopt/distributed/__init__.py 88.88% <88.88%> (ø)
torchopt/__init__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

torchopt/optim/sgd.py Outdated Show resolved Hide resolved
torchopt/linear_solve.py Outdated Show resolved Hide resolved
torchopt/distributed/autograd.py Outdated Show resolved Hide resolved
torchopt/distributed/autograd.py Show resolved Hide resolved
@XuehaiPan XuehaiPan merged commit a6aba36 into metaopt:main Sep 28, 2022
@XuehaiPan XuehaiPan deleted the distributed branch September 28, 2022 09:07
@XuehaiPan XuehaiPan added the distributed Something related to distributed training label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Something related to distributed training enhancement New feature or request example / tutorial Something related to examples or tutorials feature New feature pytorch Something PyTorch related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Task-level Optimization with Distributed Data Parallelization
3 participants