feat(distributed): RPC-based distributed training support and add distributed MAML example #83

XuehaiPan · 2022-09-21T12:56:02Z

Description

Describe your changes in detail.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #15213 if this solves the issue #15213

I have raised an issue to propose this change (required for new features and bug fixes)

Resolves #57

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

Implemented Tasks

New APIs

torchopt.distributed.is_available
torchopt.distributed.backward

World and Process Group:

torchopt.distributed.auto_init_rpc
torchopt.distributed.get_world_info
torchopt.distributed.get_world_rank (torchopt.distributed.get_rank)
torchopt.distributed.get_world_size
torchopt.distributed.get_local_rank
torchopt.distributed.get_local_world_size
torchopt.distributed.barrier

Wrappers:

torchopt.distributed.auto_init_rpc
torchopt.distributed.on_rank
torchopt.distributed.not_on_rank
torchopt.distributed.rank_zero_only
torchopt.distributed.rank_non_zero_only

Remote call:

torchopt.distributed.remote_async_call
torchopt.distributed.remote_sync_call
torchopt.distributed.parallelize (torchopt.distributed.parallelize_sync)
torchopt.distributed.parallelize_async

Misc:

torchopt.distributed.dim_partitioner
torchopt.distributed.batch_partitioner
torchopt.distributed.exclusive_batch_partitioner
torchopt.distributed.mean_reducer
torchopt.distributed.sum_reducer

Examples:

import torchopt.distributed as todist

def worker_init_fn():
    # set process title, seeding, etc.

@todist.auto_init_rpc(worker_init_fn)
def main():
    ...
	model = Model(...)
	train(model)        # execute on rank 0 only
    save_model(model)   # execute on rank 0 only

@todist.rank_zero_only
def save_model(model):
    ...

@todist.rank_zero_only
def train(model):
    model_rref = todist.rpc.RRef(model_rref)
    dataloader = DataLoader(...)
    optimizer = Optimizer(...)

    for batch in dataloader:
        optimizer.zero_grad()
        with todist.autograd.context() as context_id:
            loss = compute_loss(model_rref, batch)
            todist.autograd.backward(context_id, loss)  # in single process, we use `loss.backward()`
            optimizer.step()

@todist.parallelize(partitioner=todist.batch_partitioner, reducer=todist.mean_reducer)
def compute_loss(model_rref, batch):
    model = model_rref.to_here()
    loss = ...
    return loss

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the code using make lint (required)
I have ensured make test pass. (required)

codecov-commenter · 2022-09-24T06:54:18Z

Codecov Report

Base: 73.20% // Head: 66.59% // Decreases project coverage by -6.60% ⚠️

Coverage data is based on head (d2a3657) compared to base (af6d24c).
Patch coverage: 37.53% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #83      +/-   ##
==========================================
- Coverage   73.20%   66.59%   -6.61%     
==========================================
  Files          33       37       +4     
  Lines        1515     1853     +338     
==========================================
+ Hits         1109     1234     +125     
- Misses        406      619     +213

Flag	Coverage Δ
unittests	`66.59% <37.53%> (-6.61%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
torchopt/utils.py	`23.77% <6.25%> (-5.06%)`	⬇️
torchopt/distributed/api.py	`27.81% <27.81%> (ø)`
torchopt/distributed/autograd.py	`30.61% <30.61%> (ø)`
torchopt/distributed/world.py	`53.76% <53.76%> (ø)`
torchopt/pytree.py	`63.63% <55.55%> (-36.37%)`	⬇️
torchopt/typing.py	`96.15% <83.33%> (-3.85%)`	⬇️
torchopt/distributed/__init__.py	`88.88% <88.88%> (ø)`
torchopt/__init__.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

torchopt/distributed/autograd.py

…tributed MAML example

torchopt/optim/sgd.py

torchopt/linear_solve.py

torchopt/distributed/autograd.py

XuehaiPan added enhancement New feature or request pytorch Something PyTorch related example / tutorial Something related to examples or tutorials feature New feature labels Sep 21, 2022

XuehaiPan force-pushed the distributed branch from 6ba8eae to 16add76 Compare September 24, 2022 13:58

XuehaiPan requested a review from JieRen98 September 24, 2022 14:22

XuehaiPan self-assigned this Sep 24, 2022

XuehaiPan added this to the 0.6.0 milestone Sep 24, 2022

XuehaiPan commented Sep 24, 2022

View reviewed changes

torchopt/distributed/autograd.py Outdated Show resolved Hide resolved

XuehaiPan force-pushed the distributed branch 3 times, most recently from 92cc6ea to d738101 Compare September 24, 2022 15:37

XuehaiPan mentioned this pull request Sep 25, 2022

refactor: reorganize code and add full type hint #92

Merged

XuehaiPan force-pushed the distributed branch from d738101 to 1826df3 Compare September 25, 2022 11:42

feat(distributed): RPC-based distributed training support and add dis…

111ddb3

…tributed MAML example

XuehaiPan force-pushed the distributed branch from 1826df3 to 111ddb3 Compare September 25, 2022 11:43

feat(distributed): add barrier

a29292f

XuehaiPan force-pushed the distributed branch from a4ef769 to a29292f Compare September 26, 2022 11:50

XuehaiPan added 4 commits September 26, 2022 19:52

feat(distributed): add local dataloader example

caade96

feat(distributed): add barrier

dfcbe41

chore: update docs requirements

a73c755

chore(distributed): update default worker name format

2e2ea43

JieRen98 reviewed Sep 28, 2022

View reviewed changes

torchopt/optim/sgd.py Outdated Show resolved Hide resolved

torchopt/linear_solve.py Outdated Show resolved Hide resolved

torchopt/distributed/autograd.py Outdated Show resolved Hide resolved

torchopt/distributed/autograd.py Show resolved Hide resolved

XuehaiPan added 2 commits September 28, 2022 14:47

chore(distributed): use None as grad for unused parameters

271d59d

chore(distributed): use None as grad for unused parameters

d2a3657

XuehaiPan requested review from JieRen98 September 28, 2022 07:51

XuehaiPan merged commit a6aba36 into metaopt:main Sep 28, 2022

XuehaiPan deleted the distributed branch September 28, 2022 09:07

XuehaiPan added the distributed Something related to distributed training label Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed): RPC-based distributed training support and add distributed MAML example #83

feat(distributed): RPC-based distributed training support and add distributed MAML example #83

XuehaiPan commented Sep 21, 2022 •

edited

Loading

codecov-commenter commented Sep 24, 2022 •

edited

Loading

feat(distributed): RPC-based distributed training support and add distributed MAML example #83

feat(distributed): RPC-based distributed training support and add distributed MAML example #83

Conversation

XuehaiPan commented Sep 21, 2022 • edited Loading

Description

Motivation and Context

Types of changes

Implemented Tasks

New APIs

Checklist

codecov-commenter commented Sep 24, 2022 • edited Loading

Codecov Report

XuehaiPan commented Sep 21, 2022 •

edited

Loading

codecov-commenter commented Sep 24, 2022 •

edited

Loading