Checkpointing with SLURM #2278

Taha-Bahadori · 2020-06-19T19:19:54Z

What is your question?

I have a pytorch-lightning code with checkpointing that runs well on my desktop. But when I run it on our cluster with SLURM, the checkpoints do not get saved.

Code

    model = Predictor(args)
    check = ModelCheckpoint(save_top_k=1, verbose=True, monitor='val_acc', mode='max',
            filepath='checks/{epoch})
    trainer = pl.Trainer(checkpoint_callback=check, max_epochs=100, gpus=1)
    trainer.fit(model)

What have you tried?

I run it in the cluster with the following code:

salloc -G 1 srun python main.py

What's your environment?

OS: Linux
Packaging conda

github-actions · 2020-06-19T19:20:32Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-06-19T20:01:09Z

is this with 0.8.1?

Taha-Bahadori · 2020-06-19T20:47:43Z

It was with 0.8.0. I upgraded to 0.8.1 and the problem persists.

Also, my manual checkpointing with torch.save does work.

ExpectationMax · 2020-06-20T13:37:56Z

Could be related to issue with #2231 experienced the same problems in my setup.

ExpectationMax · 2020-06-20T13:51:46Z

My hypothesis is that something is going wrong with determining the rank of the process when running on slurm, thus the logger calls are never executed and some outputs for example

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

Are never shown, otherwise training executes as regulalry.

This could be due to an incorrect (non-zero) value in rank_zero_only.rank such that the wrapped commands are never executed.

awaelchli · 2020-06-23T17:12:46Z

You might want to try after this #1504 is merged.

jeremyjordan · 2020-06-23T19:58:32Z

Can you check to see if the weights are being saved under trainer.weights_save_path?
https://github.com/PyTorchLightning/pytorch-lightning/blob/f278ac42c81fc344fe1cb673877b8a9dfca9c9b5/pytorch_lightning/trainer/training_io.py#L224

My guess is that they're being saved but for some reason hpc_save doesn't save to the normal checkpoint directory.

Borda · 2020-06-24T07:43:53Z

is this fixed by #2341?

williamFalcon · 2020-06-24T09:58:40Z

yes

Taha-Bahadori added the question Further information is requested label Jun 19, 2020

ExpectationMax mentioned this issue Jun 23, 2020

Logging on slurm stopped working #2317

Closed

williamFalcon mentioned this issue Jun 24, 2020

fixes slurm weights saving #2339

Merged

williamFalcon closed this as completed in #2339 Jun 24, 2020

Borda mentioned this issue Jun 24, 2020

fix slurm ckpt weights #2341

Merged

Borda reopened this Jun 24, 2020

williamFalcon closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing with SLURM #2278

Checkpointing with SLURM #2278

Taha-Bahadori commented Jun 19, 2020 •

edited

Loading

github-actions bot commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

Taha-Bahadori commented Jun 19, 2020 •

edited

Loading

ExpectationMax commented Jun 20, 2020

ExpectationMax commented Jun 20, 2020

awaelchli commented Jun 23, 2020

jeremyjordan commented Jun 23, 2020

Borda commented Jun 24, 2020

williamFalcon commented Jun 24, 2020

Checkpointing with SLURM #2278

Checkpointing with SLURM #2278

Comments

Taha-Bahadori commented Jun 19, 2020 • edited Loading

What is your question?

Code

What have you tried?

What's your environment?

github-actions bot commented Jun 19, 2020

williamFalcon commented Jun 19, 2020

Taha-Bahadori commented Jun 19, 2020 • edited Loading

ExpectationMax commented Jun 20, 2020

ExpectationMax commented Jun 20, 2020

awaelchli commented Jun 23, 2020

jeremyjordan commented Jun 23, 2020

Borda commented Jun 24, 2020

williamFalcon commented Jun 24, 2020

Taha-Bahadori commented Jun 19, 2020 •

edited

Loading

Taha-Bahadori commented Jun 19, 2020 •

edited

Loading