Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'bad value(s) in fds_to_keep' error in DDP mode #1550

Closed
bobofzhang opened this issue Apr 21, 2020 · 5 comments
Closed

'bad value(s) in fds_to_keep' error in DDP mode #1550

bobofzhang opened this issue Apr 21, 2020 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@bobofzhang
Copy link

bobofzhang commented Apr 21, 2020

🐛 Bug

To Reproduce

if i put spectral_norm in the model, it will output the error msg "bad value(s) in fds_to_keep"
event the example provided by pytorch-lightning have this kind of issue.

Steps to reproduce the behavior:
change the example model lightning_template.py: to

   ` self.c_d1 = nn.Linear(in_features=self.hparams.in_features,
                          out_features=self.hparams.hidden_dim) 

    self.c_d1 = spectral_norm(self.c_d1) 

    self.c_d1_bn = nn.BatchNorm1d(self.hparams.hidden_dim) 

    self.c_d1_drop = nn.Dropout(self.hparams.drop_prob) 

    self.c_d2 = nn.Linear(in_features=self.hparams.hidden_dim,
                          out_features=self.hparams.out_features)  

    self.c_d2 = spectral_norm(self.c_d2) `

run the example with
python3 gpu_template.py --gpus 2 --distributed_backend ddp

we will get error msg
Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 692, in fit mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,)) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 162, in spawn process.start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch cmd, self._fds) File "/usr/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds False, False, None) ValueError: bad value(s) in fds_to_keep

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
      • Tesla V100-SXM2-32GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.2
    • pyTorch_debug: False
    • pyTorch_version: 1.4.0
    • pytorch-lightning: 0.7.3
    • tensorboard: 2.2.0
    • tqdm: 4.45.0
  • System:
@bobofzhang bobofzhang added bug Something isn't working help wanted Open to be worked on labels Apr 21, 2020
@bobofzhang
Copy link
Author

lightning_template.txt
gpu_template.txt

you can rename .txt to .py to verify the bug

@sneiman
Copy link
Contributor

sneiman commented Apr 25, 2020

Check #538. Relevant solution copied here for your convenience. I don't know exactly what spectral_norm() returns but this will give you a lead to check out:

(from #538):
I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with model.cuda(gpu_idx) in ddp_train(). The reference is in another process space when ddp is used, and so creates the multiprocessing fault noted at the head of this issue.

This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:

    # nn.Parameter() ensures pytorch knows about this - and will move it new gpu when required
    self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)

    # causes a crash if using ddp: self.class_p_t refers to the original process space, not the one it has been moved to by ddp
    self.class_p_t      = self.class_p.data

self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:

    # this now works
    self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
    self.class_p_t      = copy.deepcopy(self.class_p.data)

Hope this helps ...

@bobofzhang
Copy link
Author

thanks for your reply, spectral_norm is a standard module in pytorch, and I can run it in pure pytorch implementation, but if i use pytorch_lightning, it report bug as above, so i think this may be a bug in pytorch_lightning.

@sneiman
Copy link
Contributor

sneiman commented Apr 28, 2020

A very frustrating situation for you, I am sure. I am a little suspicious that this is actually a problem with spectral_norm(). It makes some internal cloning decisions that might be causing this problem. I posted a question on pytorch referring to this issue.

@rmrao
Copy link
Contributor

rmrao commented May 14, 2020

Out of curiosity, what happens if you don't assign the linear layer to the module before calling spectral norm? (i.e. self.c_d2 = spectral_norm(nn.Linear(...)))

I ran into a similar issue, and removing the assignment or doing a manual clone should fix it. It's possibly not a pytorch lightning bug - lightning uses torch's spawn function by default rather than the launch function which creates child threads differently, and in this case might create an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants