Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP and Profiler #1774

Closed
anthonytec2 opened this issue May 11, 2020 · 1 comment · Fixed by #2029
Closed

DDP and Profiler #1774

anthonytec2 opened this issue May 11, 2020 · 1 comment · Fixed by #2029
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@anthonytec2
Copy link
Contributor

🐛 Bug

This bug was mentioned on Slack to @jeremyjordan and I have also seen it. Basically, when you enable DDP and set profiler=True for the trainer args it results in the error: TypeError: can't pickle _thread.RLock objects. I am unsure if this is the intended behavior given the parallel backend, but thought I would mention this.

To Reproduce

Set the distributed_backend to ddp and profiler to True. This will result in the bellow error:

  File "/home/anthony/robotics/learning/cheap_robot/cli.py", line 89, in main
    trainer.fit(model, train_dataloader=td)
  File "/home/anthony/.cache/pypoetry/virtualenvs/robotics-zp-60jGk-py3.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/anthony/.cache/pypoetry/virtualenvs/robotics-zp-60jGk-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle _thread.RLock objects

Expected behavior

Either a warning saying that the profiler does not currently work with the ddp backend or the profiled time report with the ddp backend.

Environment

  • CUDA:
    - GPU:
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - GeForce RTX 2080 Ti
    - available: True
    - version: 10.1
  • Packages:
    - numpy: 1.18.4
    - pyTorch_debug: False
    - pyTorch_version: 1.4.0
    - pytorch-lightning: 0.7.5
    - tensorboard: 2.1.1
    - tqdm: 4.46.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.6.8
@anthonytec2 anthonytec2 added bug Something isn't working help wanted Open to be worked on labels May 11, 2020
@Renthal
Copy link

Renthal commented Nov 5, 2020

Related, if you set profiler=AdvancedProfiler() you get a TypeError: can't pickle Profile objects whereas with profiler=True it works.
Additionally, the doc mentions the possibility to call trainer = Trainer(..., profiler="advanced") whereas this seems not to be possible (also because the argument is of type profiler: Optional[Union[BaseProfiler, bool]] = None,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants