Unrecognized `val_loss` metric #923

bsridatta · 2020-02-23T23:38:34Z

RuntimeWarnings due to unrecognized `val_loss` metric

pytorch_lightning callbacks are unable to recognize val_loss from validation_step()

To Reproduce

Run CoolModel from
Steps to reproduce the behavior:

Go to Minimal-Example
Run upto trainer.fit()
Scroll down to the end of an epoch
See error -

`/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:314: RuntimeWarning: Can save best model only with val_loss available, skipping.
' skipping.', RuntimeWarning)

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:144: RuntimeWarning: Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,avg_val_loss
RuntimeWarning)`

Code sample

Just the Minimal Example


import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as pl

class CoolModel(pl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': F.cross_entropy(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def test_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        return {'avg_test_loss': avg_loss}

    def configure_optimizers(self):
        # REQUIRED
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        # can also return a list of val dataloaders
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        # can also return a list of test dataloaders
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=32)

from pytorch_lightning import Trainer

model = CoolModel()
trainer = Trainer()
trainer.fit(model)

Output
Excluding the print from pip install and training tqdm prints

Epoch 1: 100%|██████████| 3750/3750 [00:22<00:00, 163.69batch/s, batch_idx=1874, loss=1.530, v_num=0]
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:314: RuntimeWarning: Can save best model only with val_loss available, skipping.
  ' skipping.', RuntimeWarning)
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:144: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,avg_val_loss
  RuntimeWarning)
1

Expected behavior

The 'val_loss' metric should be recognized when a dic with key val_loss is returned by validation_step()

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.7.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.4.3
[pip] numpy==1.18.1
[pip] numpydoc==0.9.2
[pip] pytorch-ignite==0.3.0
[pip] pytorch-lightning==0.6.0
[pip] pytorch-pretrained-bert==0.6.2
[pip] pytorch-transformers==1.1.0
[pip] torch==1.4.0
[pip] torchaudio==0.4.0a0+719bcc7
[pip] torchtext==0.5.0
[pip] torchvision==0.4.2
[conda] blas 1.0 mkl
[conda] cpuonly 1.0 0 pytorch
[conda] mkl 2019.3 199
[conda] mkl-service 2.0.2 py36h7b6447c_0
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.4.0 py3.6_cpu_0 [cpuonly] pytorch
[conda] pytorch-ignite 0.3.0 pypi_0 pypi
[conda] pytorch-lightning 0.6.0 pypi_0 pypi
[conda] pytorch-pretrained-bert 0.6.2 pypi_0 pypi
[conda] pytorch-transformers 1.1.0 pypi_0 pypi
[conda] torchaudio 0.4.0 py36 pytorch
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.4.2 pypi_0 pypi

Additional context

Something interesting is you cannot reproduce this when you run CoolSystem.
You won't get any warning because validation_end returns tensorboard logs with val_loss as a key. If you remove this log you will able to catch the bug.

For what its worth, I was trying to make a PyTorch Lightning based Kernel in Kaggle and WandB as logger. After trying "a lot" to find a mistake in my code, I realized that this warning is not shown every time. Once you run and get the warning and re-run the block (I was using Kaggle Kernel Notebook) it disappears. I assume that it is reading from some cache? Since the val_loss is something for the first epoch(maybe something from last epoch or log) and stay 0 for the rest. I am not familiar with the internal working of PL but I suspect there is some mix up between metrics that are logged and metrics returned by lightning methods.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-02-23T23:39:09Z

Hey, thanks for your contribution! Great first issue!

awaelchli · 2020-02-24T00:16:01Z

The EarlyStopping looks for the key val_loss in the validation_end, not validation_step. That's a correct behaviour.
Possible fix 1: Pass in a ModelCheckpoint to Trainer and change the monitor argument to "avg_val_loss" (as returned by validation_end).
Possible fix 2: Change "avg_val_loss" to "val_loss" in validation_end.

@williamFalcon This is a recurring issue that users have with the provided examples. They should probably get updated so that the default EarlyStopping just works from the beginning without producing warnings.

awaelchli · 2020-02-24T00:21:01Z

@bsridatta Actually, I just saw that you are linking an old documentation in your post. The examples are already updated. See here, the fix 2 I suggested is already done:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/lightning_module_template.py

awaelchli · 2020-02-24T00:30:28Z

It got fixed here #524

bsridatta · 2020-02-24T00:54:53Z

Thanks for the quick response and clarification.

The EarlyStopping looks for the key val_loss in the validation_end, not validation_step

I understood this wrong. I have used the example I mentioned earlier in November and did even doubt that it might have changed. It's clear now, thanks!

bsridatta added bug Something isn't working help wanted Open to be worked on labels Feb 23, 2020

bsridatta closed this as completed Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecognized `val_loss` metric #923

Unrecognized `val_loss` metric #923

bsridatta commented Feb 23, 2020

github-actions bot commented Feb 23, 2020

awaelchli commented Feb 24, 2020

awaelchli commented Feb 24, 2020

awaelchli commented Feb 24, 2020

bsridatta commented Feb 24, 2020 •

edited

Loading

Unrecognized val_loss metric #923

Unrecognized val_loss metric #923

Comments

bsridatta commented Feb 23, 2020

RuntimeWarnings due to unrecognized val_loss metric

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Feb 23, 2020

awaelchli commented Feb 24, 2020

awaelchli commented Feb 24, 2020

awaelchli commented Feb 24, 2020

bsridatta commented Feb 24, 2020 • edited Loading

Unrecognized `val_loss` metric #923

Unrecognized `val_loss` metric #923

RuntimeWarnings due to unrecognized `val_loss` metric

bsridatta commented Feb 24, 2020 •

edited

Loading