You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In multi-GPU mode with DDP, starting the training can fail with the following error:
Traceback (most recent call last):
File "train.py", line 109, in <module>
main(hyperparams)
File "train.py", line 62, in main
trainer.fit(model)
File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/tanel/miniconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 415, in run_pretrain_routine
self.configure_checkpoint_callback()
File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_config_mixin.py", line 28, in configure_checkpoint_callback
filepath=ckpt_path
File "/home/tanel/miniconda3/lib/python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py", line 202, in __init__
os.makedirs(filepath)
File "/home/tanel/miniconda3/lib/python3.7/os.py", line 221, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/tanel/devel/torch-xvectors/lightning_logs/version_151/checkpoints'
It happens quite rarely.
Desktop (please complete the following information):
OS: Linux
Version 0.5.3.2 (git as of 2019-11-20)
Fix is easy, I'll create a PR.
The text was updated successfully, but these errors were encountered:
In multi-GPU mode with DDP, starting the training can fail with the following error:
It happens quite rarely.
Desktop (please complete the following information):
Fix is easy, I'll create a PR.
The text was updated successfully, but these errors were encountered: