Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

Closed
dpieczynski opened this issue Feb 22, 2021 · 1 comment
Labels
bug Something isn't working checkpointing Related to checkpointing distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task
Milestone

Comments

@dpieczynski
Copy link

🐛 Bug

When using DDP with 2 GPUs and logging validation loss in validation_step with self.log('val_loss', loss, sync_dist=True) , ModelCheckpoint callback embeds validation loss that is multiplied by 2 (number of GPUs?) in the filename. This happens in Lightning 1.2.0.

This is a message printed by ModelCheckpoint callback:

Epoch 0, global step 0: val_loss reached 2.20627 (best 2.20627), saving model to "some_path/epoch=0-val_loss=4.41254.ckpt" as top 1

To Reproduce

def test_run():
    from pytorch_lightning.callbacks import ModelCheckpoint

    class TestModel(BoringModel):

        def validation_step(self, batch, batch_idx) -> None:
            output = self.layer(batch)
            loss = self.loss(batch, output)
            self.log('val_loss', loss, sync_dist=True)

        def validation_epoch_end(self, outputs) -> None:
            pass

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = TestModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        accelerator='ddp',
        gpus=-1,
        callbacks=[ModelCheckpoint(dirpath=os.getcwd(), filename='{epoch}-{val_loss:.5f}', monitor='val_loss',
                                   verbose=True)]
    )

    trainer.fit(model, train_data, val_data)

Expected behavior

The loss embedded in the filename should be the same as the loss in the message and logger.

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Python version: 3.8.6
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: 2 * GeForce RTX 2080 Ti
@dpieczynski dpieczynski added bug Something isn't working help wanted Open to be worked on labels Feb 22, 2021
@awaelchli awaelchli added checkpointing Related to checkpointing distributed Generic distributed-related topic labels Feb 22, 2021
@carmocca carmocca self-assigned this Feb 22, 2021
@carmocca carmocca added the priority: 1 Medium priority task label Feb 22, 2021
@carmocca carmocca added this to the 1.2.x milestone Feb 22, 2021
@Borda Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Apr 21, 2021

@Rivi I can't reproduce it with your instructions on master branch and also not on 1.2.8. I get:

Epoch 0, global step 0: val_loss reached 0.77993 (best 0.77993), saving model to "/home/adrian/repositories/pytorch-lightning/epoch=0-val_loss=0.77993.ckpt" as top 1

EDIT:
I can confirm the problem exists up to 1.2.3 and is solved in 1.2.4 and above.
Upgrading to 1.2.8 (latest) will solve this problem for you!

EDIT: probably this PR fixed it #6410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

No branches or pull requests

4 participants