'bad value(s) in fds_to_keep' error in DDP mode #1550

bobofzhang · 2020-04-21T17:19:38Z

🐛 Bug

To Reproduce

if i put spectral_norm in the model, it will output the error msg "bad value(s) in fds_to_keep"
event the example provided by pytorch-lightning have this kind of issue.

Steps to reproduce the behavior:
change the example model lightning_template.py: to

   ` self.c_d1 = nn.Linear(in_features=self.hparams.in_features,
                          out_features=self.hparams.hidden_dim) 

    self.c_d1 = spectral_norm(self.c_d1) 

    self.c_d1_bn = nn.BatchNorm1d(self.hparams.hidden_dim) 

    self.c_d1_drop = nn.Dropout(self.hparams.drop_prob) 

    self.c_d2 = nn.Linear(in_features=self.hparams.hidden_dim,
                          out_features=self.hparams.out_features)  

    self.c_d2 = spectral_norm(self.c_d2) `

run the example with
python3 gpu_template.py --gpus 2 --distributed_backend ddp

we will get error msg
Traceback (most recent call last): File "gpu_template.py", line 80, in <module> main(hyperparams) File "gpu_template.py", line 41, in main trainer.fit(model) File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 692, in fit mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,)) File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 162, in spawn process.start() File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__ super().__init__(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch cmd, self._fds) File "/usr/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds False, False, None) ValueError: bad value(s) in fds_to_keep

Environment

CUDA:
- GPU:
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
  - Tesla V100-SXM2-32GB
- available: True
- version: 10.1
Packages:
- numpy: 1.18.2
- pyTorch_debug: False
- pyTorch_version: 1.4.0
- pytorch-lightning: 0.7.3
- tensorboard: 2.2.0
- tqdm: 4.45.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.6.9
- version: revert to absolute imports #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019

The text was updated successfully, but these errors were encountered:

bobofzhang · 2020-04-21T18:57:11Z

lightning_template.txt
gpu_template.txt

you can rename .txt to .py to verify the bug

sneiman · 2020-04-25T15:02:48Z

Check #538. Relevant solution copied here for your convenience. I don't know exactly what spectral_norm() returns but this will give you a lead to check out:

(from #538):
I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with model.cuda(gpu_idx) in ddp_train(). The reference is in another process space when ddp is used, and so creates the multiprocessing fault noted at the head of this issue.

This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:

    # nn.Parameter() ensures pytorch knows about this - and will move it new gpu when required
    self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)

    # causes a crash if using ddp: self.class_p_t refers to the original process space, not the one it has been moved to by ddp
    self.class_p_t      = self.class_p.data

self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:

    # this now works
    self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
    self.class_p_t      = copy.deepcopy(self.class_p.data)

Hope this helps ...

bobofzhang · 2020-04-26T15:15:15Z

thanks for your reply, spectral_norm is a standard module in pytorch, and I can run it in pure pytorch implementation, but if i use pytorch_lightning, it report bug as above, so i think this may be a bug in pytorch_lightning.

sneiman · 2020-04-28T14:22:22Z

A very frustrating situation for you, I am sure. I am a little suspicious that this is actually a problem with spectral_norm(). It makes some internal cloning decisions that might be causing this problem. I posted a question on pytorch referring to this issue.

rmrao · 2020-05-14T18:47:41Z

Out of curiosity, what happens if you don't assign the linear layer to the module before calling spectral norm? (i.e. self.c_d2 = spectral_norm(nn.Linear(...)))

I ran into a similar issue, and removing the assignment or doing a manual clone should fix it. It's possibly not a pytorch lightning bug - lightning uses torch's spawn function by default rather than the launch function which creates child threads differently, and in this case might create an error.

bobofzhang added bug Something isn't working help wanted Open to be worked on labels Apr 21, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'bad value(s) in fds_to_keep' error in DDP mode #1550

'bad value(s) in fds_to_keep' error in DDP mode #1550

bobofzhang commented Apr 21, 2020 •

edited

Loading

bobofzhang commented Apr 21, 2020

sneiman commented Apr 25, 2020

bobofzhang commented Apr 26, 2020

sneiman commented Apr 28, 2020

rmrao commented May 14, 2020

'bad value(s) in fds_to_keep' error in DDP mode #1550

'bad value(s) in fds_to_keep' error in DDP mode #1550

Comments

bobofzhang commented Apr 21, 2020 • edited Loading

🐛 Bug

To Reproduce

Environment

bobofzhang commented Apr 21, 2020

sneiman commented Apr 25, 2020

bobofzhang commented Apr 26, 2020

sneiman commented Apr 28, 2020

rmrao commented May 14, 2020

bobofzhang commented Apr 21, 2020 •

edited

Loading