DDP and Profiler #1774

anthonytec2 · 2020-05-11T00:44:45Z

🐛 Bug

This bug was mentioned on Slack to @jeremyjordan and I have also seen it. Basically, when you enable DDP and set profiler=True for the trainer args it results in the error: TypeError: can't pickle _thread.RLock objects. I am unsure if this is the intended behavior given the parallel backend, but thought I would mention this.

To Reproduce

Set the distributed_backend to ddp and profiler to True. This will result in the bellow error:

  File "/home/anthony/robotics/learning/cheap_robot/cli.py", line 89, in main
    trainer.fit(model, train_dataloader=td)
  File "/home/anthony/.cache/pypoetry/virtualenvs/robotics-zp-60jGk-py3.6/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/home/anthony/.cache/pypoetry/virtualenvs/robotics-zp-60jGk-py3.6/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle _thread.RLock objects

Expected behavior

Either a warning saying that the profiler does not currently work with the ddp backend or the profiled time report with the ddp backend.

Environment

CUDA:
- GPU:
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- GeForce RTX 2080 Ti
- available: True
- version: 10.1
Packages:
- numpy: 1.18.4
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.5
- tensorboard: 2.1.1
- tqdm: 4.46.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.6.8

The text was updated successfully, but these errors were encountered:

Renthal · 2020-11-05T15:47:09Z

Related, if you set profiler=AdvancedProfiler() you get a TypeError: can't pickle Profile objects whereas with profiler=True it works.
Additionally, the doc mentions the possibility to call trainer = Trainer(..., profiler="advanced") whereas this seems not to be possible (also because the argument is of type profiler: Optional[Union[BaseProfiler, bool]] = None,.

anthonytec2 added bug Something isn't working help wanted Open to be worked on labels May 11, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP and Profiler #1774

DDP and Profiler #1774

anthonytec2 commented May 11, 2020

Renthal commented Nov 5, 2020

DDP and Profiler #1774

DDP and Profiler #1774

Comments

anthonytec2 commented May 11, 2020

🐛 Bug

To Reproduce

Expected behavior

Environment

Renthal commented Nov 5, 2020