Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with DDP could fail at startup due to FileExistsError #529

Closed
alumae opened this issue Nov 20, 2019 · 0 comments · Fixed by #530
Closed

Training with DDP could fail at startup due to FileExistsError #529

alumae opened this issue Nov 20, 2019 · 0 comments · Fixed by #530
Labels
bug Something isn't working

Comments

@alumae
Copy link
Contributor

alumae commented Nov 20, 2019

In multi-GPU mode with DDP, starting the training can fail with the following error:

Traceback (most recent call last):
  File "train.py", line 109, in <module>
    main(hyperparams)
  File "train.py", line 62, in main
    trainer.fit(model)
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 415, in run_pretrain_routine
    self.configure_checkpoint_callback()
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_config_mixin.py", line 28, in configure_checkpoint_callback
    filepath=ckpt_path
  File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py", line 202, in __init__
    os.makedirs(filepath)
  File "/home/tanel/miniconda3/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/tanel/devel/torch-xvectors/lightning_logs/version_151/checkpoints'

It happens quite rarely.

Desktop (please complete the following information):

  • OS: Linux
  • Version 0.5.3.2 (git as of 2019-11-20)

Fix is easy, I'll create a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant