Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddp: trainer.test failure #2133

Closed
sshleifer opened this issue Jun 9, 2020 · 7 comments
Closed

ddp: trainer.test failure #2133

sshleifer opened this issue Jun 9, 2020 · 7 comments
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@sshleifer
Copy link
Contributor

Versions:

torch==1.5
pytorch-lightning==0.8.0rc1
cuda=10.1

Wit or without fp16, trainer.test(model) fails with

initializing ddp: LOCAL_RANK: 0/1 WORLD_SIZE:2
Traceback (most recent call last):
  File "finetune.py", line 791, in <module>
    main(args)
  File "finetune.py", line 738, in main
    trainer.test(model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line
1096, in test
    self.fit(model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line
876, in fit
    self.ddp_train(task, model)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/trainer/distrib_data_paral
lel.py", line 429, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing
_tasks)
  File "/home/shleifer/pytorch-lightning/pytorch_lightning/core/lightning.py", line 9
60, in init_ddp_connection
    torch_distrib.init_process_group(torch_backend, rank=proc_rank, world_size=world_
size)
  File "/home/shleifer/.conda/envs/nb/lib/python3.7/site-packages/torch/distributed/$
istributed_c10d.py", line 364, in init_process_group
    raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!
@sshleifer sshleifer added the help wanted Open to be worked on label Jun 9, 2020
@Anjum48
Copy link

Anjum48 commented Jun 9, 2020

I am also getting this issue in a CV loop where trainer.fit is called for the second time.

@Anjum48
Copy link

Anjum48 commented Jun 9, 2020

I think I have a functioning workaround. As the message describes, the problem is that the processes are initialized a second time, but a simple check can avoid this error.

Changing this line: https://github.com/PyTorchLightning/pytorch-lightning/blob/7245e48153909d9de8458b1f5b8b2bc740d80104/pytorch_lightning/trainer/distrib_data_parallel.py#L429

To this:

if not torch.distributed.is_initialized():
    model.init_ddp_connection(self.proc_rank, self.world_size, self.is_slurm_managing_tasks)

seems to get my CV loop to work.

Happy to open a PR if the workaround looks ok.

@Borda Borda added bug Something isn't working priority: 0 High priority task labels Jun 9, 2020
@Borda Borda added this to the 0.8.0 milestone Jun 9, 2020
@Anjum48
Copy link

Anjum48 commented Jun 10, 2020

Maybe this isn't as straightforward as I thought. After some time, one of my DataLoader processes aborts and gives this error:

Traceback (most recent call last):
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 147, in <module>
    cross_validation(args)
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 93, in cross_validation
    train_single_fold(
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/train.py", line 78, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 876, in fit
    self.ddp_train(task, model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 474, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in run_pretrain_routine
    self.train()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
    _outputs = self.run_training_batch(batch, batch_idx)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 628, in run_training_batch
    self.batch_loss_value.append(loss)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 44, in append
    x = x.to(self.memory)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3159278) is killed by signal: Aborted. 
Exception ignored in: <function tqdm.__del__ at 0x7f12b6606dc0>
Traceback (most recent call last):
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1077, in __del__
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1284, in close
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1461, in display
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1080, in __repr__
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/tqdm/std.py", line 1424, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I'm not sure if this is related to the timeout here: https://github.com/pytorch/pytorch/blob/master/torch/distributed/constants.py

@zackcarson
Copy link

I'm also having this same issue on the latest version!

@Borda Borda modified the milestones: 0.8.0, 0.8.x Jun 18, 2020
@MRI000000
Copy link

I am also troubled by this question

@williamFalcon williamFalcon modified the milestones: 0.8.x, fix .test() Jun 26, 2020
@Borda Borda modified the milestones: fix .test(), 0.9.0 Jul 7, 2020
@awaelchli
Copy link
Member

maybe william fixed this in #2512
Could you try master branch?

@williamFalcon
Copy link
Contributor

fixed! in 0.8.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

No branches or pull requests

7 participants