Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in DDP, but not DP modes. #1628

Closed
ternaus opened this issue Apr 27, 2020 · 5 comments · Fixed by #1632
Closed

Bug in DDP, but not DP modes. #1628

ternaus opened this issue Apr 27, 2020 · 5 comments · Fixed by #1632
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@ternaus
Copy link

ternaus commented Apr 27, 2020

Pytorch 1.5

In [3]: pytorch_lightning.__version__                                                                                                                                                                                                                                                                
Out[3]: '0.7.5rc1'

In DP everything works.

In DDP fails with:

 File "/home/vladimir/anaconda3/envs/solaris/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/vladimir/anaconda3/envs/solaris/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/vladimir/anaconda3/envs/solaris/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'torch._C._VariableFunctions'>: it's not the same object as torch._C._VariableFunctions
@ternaus ternaus added bug Something isn't working help wanted Open to be worked on labels Apr 27, 2020
@williamFalcon
Copy link
Contributor

can you post a colab? does 0.7.4 work ok?

@williamFalcon
Copy link
Contributor

@quinor
Copy link
Contributor

quinor commented Apr 27, 2020

I am getting the same issue.

  • The model is picklable
  • It worked fine on 0.7.2
  • It breaks after the update to 0.7.4

My stacktrace:

Traceback (most recent call last):
  File "./train_stage3.py", line 254, in <module>
    trainer.fit(model)
  File "/net/people/plgquinor/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 744, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/net/people/plgquinor/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/net/people/plgquinor/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
    process.start()
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/net/software/local/python/3.6.5/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'torch._C._VariableFunctions'>: it's not the same object as torch._C._VariableFunctions

@quinor
Copy link
Contributor

quinor commented Apr 27, 2020

The model is picklable - the error appears when I try to pickle the Trainer instance.

[Update] I had found the culprit. ModelCheckpoint is non-picklable due to:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/callbacks/model_checkpoint.py#L118
torch.lt, torch.gt etc. being non-picklable. It worked previously when those were np.* ops.

@williamFalcon
Copy link
Contributor

@quinor @ternaus thanks for bringing this up! turned out to be a bigger deal haha.
Released 0.7.5 to fix - please skip 0.7.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants