Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing with SLURM #2278

Closed
Taha-Bahadori opened this issue Jun 19, 2020 · 9 comments · Fixed by #2339
Closed

Checkpointing with SLURM #2278

Taha-Bahadori opened this issue Jun 19, 2020 · 9 comments · Fixed by #2339
Labels
question Further information is requested

Comments

@Taha-Bahadori
Copy link

Taha-Bahadori commented Jun 19, 2020

What is your question?

I have a pytorch-lightning code with checkpointing that runs well on my desktop. But when I run it on our cluster with SLURM, the checkpoints do not get saved.

Code

    model = Predictor(args)
    check = ModelCheckpoint(save_top_k=1, verbose=True, monitor='val_acc', mode='max',
            filepath='checks/{epoch})
    trainer = pl.Trainer(checkpoint_callback=check, max_epochs=100, gpus=1)
    trainer.fit(model)

What have you tried?

I run it in the cluster with the following code:

salloc -G 1 srun python main.py

What's your environment?

  • OS: Linux
  • Packaging conda
@Taha-Bahadori Taha-Bahadori added the question Further information is requested label Jun 19, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

is this with 0.8.1?

@Taha-Bahadori
Copy link
Author

Taha-Bahadori commented Jun 19, 2020

It was with 0.8.0. I upgraded to 0.8.1 and the problem persists.

Also, my manual checkpointing with torch.save does work.

@ExpectationMax
Copy link
Contributor

Could be related to issue with #2231 experienced the same problems in my setup.

@ExpectationMax
Copy link
Contributor

My hypothesis is that something is going wrong with determining the rank of the process when running on slurm, thus the logger calls are never executed and some outputs for example

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]

Are never shown, otherwise training executes as regulalry.

This could be due to an incorrect (non-zero) value in rank_zero_only.rank such that the wrapped commands are never executed.

@awaelchli
Copy link
Member

You might want to try after this #1504 is merged.

@jeremyjordan
Copy link
Contributor

Can you check to see if the weights are being saved under trainer.weights_save_path?
https://github.com/PyTorchLightning/pytorch-lightning/blob/f278ac42c81fc344fe1cb673877b8a9dfca9c9b5/pytorch_lightning/trainer/training_io.py#L224

My guess is that they're being saved but for some reason hpc_save doesn't save to the normal checkpoint directory.

@Borda
Copy link
Member

Borda commented Jun 24, 2020

is this fixed by #2341?

@Borda Borda reopened this Jun 24, 2020
@williamFalcon
Copy link
Contributor

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants