You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeing a seemingly similar problem as issues #1335 and #1637 on current master when using ddp_cpu on my universities SLURM cluster. It's failing at a certain epoch (not the same for every job, but in the same range) for all jobs in a job array, on all nodes. But it's not happening when load_spawn_weights() is being invoked, but when when save_spawn_weights() is.
Error
slurmstepd-breitach: error: Unable to create TMPDIR [/tmp/user/30335]: Permission denied
slurmstepd-breitach: error: Setting TMPDIR to /tmp
psutil is not installed. You will not be able to abort this experiment from the UI.
psutil is not installed. Hardware metrics will not be collected.
NeptuneLogger will work in online mode
GPU available: True, used: False
No environment variable for node rank defined. Set as 0.
MASTER_ADDR environment variable is not defined. Set as localhost
initializing proc_rank 0 world 1
Set SLURM handle signals.
| Name | Type | Params
-----------------------------------------------------
0 | model | RecurrentModel | 5
1 | model.recurrent_model | RNN | 4
2 | model.fc | Linear | 1
/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Traceback (most recent call last):
File "experiment_core.py", line 40, in <module>
main(hyperparams, parser)
File "experiment_core.py", line 25, in main
trainer.fit(model)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 856, in fit
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 391, in ddp_train
self.save_spawn_weights(model)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 401, in save_spawn_weights
self.save_checkpoint(path)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 265, in save_checkpoint
self._atomic_save(checkpoint, filepath)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/pytorch_lightning/trainer/training_io.py", line 256, in _atomic_save
torch.save(checkpoint, tmp_path)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 369, in save
with _open_file_like(f, 'wb') as opened_file:
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 234, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/sch/schillmann/anaconda3/envs/pytorch-bac/lib/python3.7/site-packages/torch/serialization.py", line 215, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/home/sch/schillmann/Bachelor-Thesis/Code/Memory-Network-Memory-Horizons/src/logs/__temp_weight_ddp_end.ckpt.part'
Expected behavior
All jobs to finish on all nodes without problems.
The text was updated successfully, but these errors were encountered:
Dunrar
changed the title
ddp_cpu on SLURM cluster still crashing because of load_spawn_weights()
ddp_cpu crashing on SLURM cluster crashing
May 27, 2020
Dunrar
changed the title
ddp_cpu crashing on SLURM cluster crashing
ddp_cpu crashing on SLURM cluster
May 27, 2020
Dunrar
changed the title
ddp_cpu crashing on SLURM cluster
ddp_cpu crashing on SLURM cluster because of save_spawn_weights()
May 27, 2020
🐛 Bug
I'm seeing a seemingly similar problem as issues #1335 and #1637 on current master when using ddp_cpu on my universities SLURM cluster. It's failing at a certain epoch (not the same for every job, but in the same range) for all jobs in a job array, on all nodes. But it's not happening when load_spawn_weights() is being invoked, but when when save_spawn_weights() is.
Error
Expected behavior
All jobs to finish on all nodes without problems.
The text was updated successfully, but these errors were encountered: