Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ddp_cpu backend for testing ddp without GPUs #1158

Merged
merged 24 commits into from
Apr 16, 2020

Conversation

neggert
Copy link
Contributor

@neggert neggert commented Mar 15, 2020

New feature: DDP on CPU. This can be used for distributed training with out GPUs, or to test/debug DDP on single machine without GPUs.. Since PyTorch already makes good use of multiple cores under the hood, this will not provide any speedup over normal CPU training if you're only using a single node.

API changes:

  • New distributed backend: distributed_backend="ddp_cpu"
  • New Trainer argument: num_processes. Controls the number or processes per node. Is currently only used by ddp_cpu.

Still need to update the documentation, but I wanted to get some eyes on this before I got too far. Right now everything if functional as far as I can tell, and I've added a new test that covers this feature.

@pep8speaks
Copy link

pep8speaks commented Mar 15, 2020

Hello @neggert! Thanks for updating this PR.

Line 47:52: W504 line break after binary operator

Line 708:111: E501 line too long (118 > 110 characters)
Line 712:111: E501 line too long (118 > 110 characters)
Line 716:111: E501 line too long (118 > 110 characters)
Line 720:111: E501 line too long (118 > 110 characters)
Line 724:111: E501 line too long (117 > 110 characters)
Line 728:111: E501 line too long (117 > 110 characters)
Line 732:111: E501 line too long (117 > 110 characters)
Line 736:111: E501 line too long (118 > 110 characters)
Line 740:111: E501 line too long (117 > 110 characters)
Line 745:111: E501 line too long (116 > 110 characters)
Line 750:111: E501 line too long (116 > 110 characters)
Line 755:111: E501 line too long (118 > 110 characters)
Line 760:111: E501 line too long (117 > 110 characters)
Line 765:111: E501 line too long (117 > 110 characters)
Line 770:111: E501 line too long (117 > 110 characters)
Line 775:111: E501 line too long (117 > 110 characters)
Line 780:111: E501 line too long (117 > 110 characters)

Comment last updated at 2020-04-11 22:45:26 UTC

@codecov
Copy link

codecov bot commented Mar 16, 2020

Codecov Report

Merging #1158 into master will increase coverage by 0%.
The diff coverage is 95%.

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage      91%     91%           
======================================
  Files          67      67           
  Lines        3742    3760   +18     
======================================
+ Hits         3400    3418   +18     
  Misses        342     342           

@neggert neggert force-pushed the cpu_ddp branch 2 times, most recently from cac041d to 3f1f0e1 Compare March 18, 2020 14:52
@tullie
Copy link
Contributor

tullie commented Mar 18, 2020

Have you considered just making ddp work with cpu, when GPUS=0 or GPUS=None? I think if I was a user that would be the intuitive thing for ddp to do without GPUs.

@@ -855,11 +855,15 @@ def init_ddp_connection(self):
try:
root_node = os.environ['SLURM_NODELIST'].split(' ')[0]
except Exception:
root_node = '127.0.0.2'
root_node = '127.0.0.1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the IP change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew. 127.0.0.1 works, 127.0.0.2 doesn't.


root_node = self.trainer.resolve_root_node_address(root_node)
os.environ['MASTER_ADDR'] = root_node
torch_distrib.init_process_group('nccl', rank=proc_rank, world_size=world_size)
if self.trainer.on_gpu:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the different backends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL only works for GPU. The PyTorch docs recommend using GLOO for CPU training: https://pytorch.org/docs/stable/distributed.html#backends

output = self.module(*inputs, **kwargs)
# output = self.module(*inputs, **kwargs)
# lightning (ddp_cpu)
if self.module.training:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copying the changes you made above for the case where self.device_ids is None. I think previously, we would never hit this code.

@neggert
Copy link
Contributor Author

neggert commented Mar 18, 2020

@tullie As discussed in Slack, I'll change things so that distributed_backend="ddp", gpus=None also gives DDP-on-CPU behavior. (gpus=0 runs on GPU 0 I believe. Don't think we should change that.)

@neggert neggert force-pushed the cpu_ddp branch 2 times, most recently from e610a09 to 58f687a Compare March 20, 2020 20:17
@mergify
Copy link
Contributor

mergify bot commented Mar 24, 2020

This pull request is now in conflict... :(

@Borda Borda changed the title [WIP] Add ddp_cpu backend for testing ddp without CPUs [WIP] Add ddp_cpu backend for testing ddp without GPUs Mar 26, 2020
@Borda
Copy link
Member

Borda commented Mar 26, 2020

@neggert how is it going here? need any help? 🐰

@mergify mergify bot requested a review from a team March 30, 2020 22:33
@williamFalcon
Copy link
Contributor

@neggert can we get this merged so it can go into 0.7.2?

@williamFalcon williamFalcon added this to the 0.7.3 milestone Apr 3, 2020
Comment on lines 195 to 196
if self.num_gpus == 0:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these if ...: pass ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity. I wanted to make clear to the future reader that we're considering this case and intentionally doing nothing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would skip it lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment could achieve the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it with a comment

pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved
tests/trainer/test_trainer.py Outdated Show resolved Hide resolved
@neggert neggert changed the title [WIP] Add ddp_cpu backend for testing ddp without GPUs Add ddp_cpu backend for testing ddp without GPUs Apr 7, 2020
@Borda Borda requested review from tullie, williamFalcon and a team April 7, 2020 22:10
@williamFalcon williamFalcon changed the title Add ddp_cpu backend for testing ddp without GPUs [WIP] Add ddp_cpu backend for testing ddp without GPUs Apr 8, 2020
@williamFalcon
Copy link
Contributor

Is this ready for 0.7.2 or push to 0.7.3?

@Borda
Copy link
Member

Borda commented Apr 8, 2020

I would leave it for next 0.7.3 @neggert ?

@neggert
Copy link
Contributor Author

neggert commented Apr 8, 2020

Yeah, let's not rush it into a release. 0.7.3 is fine.

@neggert neggert changed the title [WIP] Add ddp_cpu backend for testing ddp without GPUs Add ddp_cpu backend for testing ddp without GPUs Apr 11, 2020
@Borda Borda added feature Is an improvement or enhancement ci Continuous Integration labels Apr 11, 2020
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team April 11, 2020 22:41
@Borda Borda added the ready PRs ready to be merged label Apr 14, 2020
@Borda Borda requested a review from jeremyjordan April 14, 2020 09:12
@williamFalcon williamFalcon merged commit e3001a0 into Lightning-AI:master Apr 16, 2020
@neggert neggert deleted the cpu_ddp branch April 16, 2020 14:54
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Jun 7, 2020
* Add tests for distributed backend config

* Refactor set_distributed_mode

* Use gloo backend on cpu

* Use 127.0.0.1 instead of 127.0.0.2

Not totally clear on why this is necessary, but it seemt to work

* Update LightningDDP so that it works with CPU

* Add ddp_cpu backend and num_processes Trainer arg

* PEP8

* Fix test skipping. Inequalities are hard :/

* Skip ddp_cpu test on Windows

* Make a few more cases fall back to ddp_cpu

* New function name

* Flake8

* Don't test distributed on MacOS with torch < 1.3

Support for distributed in MacOS was added in Torch 1.3.0

* Add ddp_cpu and num_processes to docs

* Parametrize trainer config tests

* Tweak warning

Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>

* Remove redundant test

* Replace pass branches with comments

* Add missing warnings import

* save_path -> root_dir

* Use new rank_zero_warn

* Whitespace

* Apply suggestions from code review

* formatting

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: J. Borovec <jirka.borovec@seznam.cz>
@Borda Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants