Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

dpieczynski · 2021-02-22T18:29:10Z

🐛 Bug

When using DDP with 2 GPUs and logging validation loss in validation_step with self.log('val_loss', loss, sync_dist=True) , ModelCheckpoint callback embeds validation loss that is multiplied by 2 (number of GPUs?) in the filename. This happens in Lightning 1.2.0.

This is a message printed by ModelCheckpoint callback:

Epoch 0, global step 0: val_loss reached 2.20627 (best 2.20627), saving model to "some_path/epoch=0-val_loss=4.41254.ckpt" as top 1

To Reproduce

def test_run():
    from pytorch_lightning.callbacks import ModelCheckpoint

    class TestModel(BoringModel):

        def validation_step(self, batch, batch_idx) -> None:
            output = self.layer(batch)
            loss = self.loss(batch, output)
            self.log('val_loss', loss, sync_dist=True)

        def validation_epoch_end(self, outputs) -> None:
            pass

    # fake data
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))

    # model
    model = TestModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        max_epochs=1,
        weights_summary=None,
        accelerator='ddp',
        gpus=-1,
        callbacks=[ModelCheckpoint(dirpath=os.getcwd(), filename='{epoch}-{val_loss:.5f}', monitor='val_loss',
                                   verbose=True)]
    )

    trainer.fit(model, train_data, val_data)

Expected behavior

The loss embedded in the filename should be the same as the loss in the message and logger.

Environment

PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Python version: 3.8.6
CUDA/cuDNN version: 11.0
GPU models and configuration: 2 * GeForce RTX 2080 Ti

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-04-21T11:04:55Z

@Rivi I can't reproduce it with your instructions on master branch and also not on 1.2.8. I get:

Epoch 0, global step 0: val_loss reached 0.77993 (best 0.77993), saving model to "/home/adrian/repositories/pytorch-lightning/epoch=0-val_loss=0.77993.ckpt" as top 1

EDIT:
I can confirm the problem exists up to 1.2.3 and is solved in 1.2.4 and above.
Upgrading to 1.2.8 (latest) will solve this problem for you!

EDIT: probably this PR fixed it #6410

dpieczynski added bug Something isn't working help wanted Open to be worked on labels Feb 22, 2021

awaelchli added checkpointing Related to checkpointing distributed Generic distributed-related topic labels Feb 22, 2021

carmocca self-assigned this Feb 22, 2021

carmocca added the priority: 1 Medium priority task label Feb 22, 2021

carmocca added this to the 1.2.x milestone Feb 22, 2021

carmocca mentioned this issue Mar 7, 2021

trainer.training_type_plugin.broadcast doesn't seem to work properly #6343

Closed

edenlightning unassigned carmocca Mar 10, 2021

Borda modified the milestones: 1.2.x, 1.3 Apr 18, 2021

awaelchli closed this as completed Apr 21, 2021

awaelchli mentioned this issue Apr 21, 2021

What I use ddp, in validation_step that loged metric seems curious. #7112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

dpieczynski commented Feb 22, 2021

awaelchli commented Apr 21, 2021 •

edited

Loading

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138

Comments

dpieczynski commented Feb 22, 2021

🐛 Bug

To Reproduce

Expected behavior

Environment

awaelchli commented Apr 21, 2021 • edited Loading

awaelchli commented Apr 21, 2021 •

edited

Loading