Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecognized val_loss metric #923

Closed
bsridatta opened this issue Feb 23, 2020 · 5 comments
Closed

Unrecognized val_loss metric #923

bsridatta opened this issue Feb 23, 2020 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@bsridatta
Copy link

RuntimeWarnings due to unrecognized val_loss metric

pytorch_lightning callbacks are unable to recognize val_loss from validation_step()

To Reproduce

Run CoolModel from
Steps to reproduce the behavior:

  1. Go to Minimal-Example
  2. Run upto trainer.fit()
  3. Scroll down to the end of an epoch
  4. See error -

`/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:314: RuntimeWarning: Can save best model only with val_loss available, skipping.
' skipping.', RuntimeWarning)

/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:144: RuntimeWarning: Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,avg_val_loss
RuntimeWarning)`

Code sample

Just the Minimal Example


import os
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
import torchvision.transforms as transforms

import pytorch_lightning as pl

class CoolModel(pl.LightningModule):

    def __init__(self):
        super(CoolModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        return {'loss': F.cross_entropy(y_hat, y)}

    def validation_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        return {'avg_val_loss': avg_loss}

    def test_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        return {'avg_test_loss': avg_loss}

    def configure_optimizers(self):
        # REQUIRED
        return torch.optim.Adam(self.parameters(), lr=0.02)

    @pl.data_loader
    def train_dataloader(self):
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        # can also return a list of val dataloaders
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=32)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        # can also return a list of test dataloaders
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=32)

from pytorch_lightning import Trainer

model = CoolModel()
trainer = Trainer()
trainer.fit(model)

Output
Excluding the print from pip install and training tqdm prints

Epoch 1: 100%|██████████| 3750/3750 [00:22<00:00, 163.69batch/s, batch_idx=1874, loss=1.530, v_num=0]
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:314: RuntimeWarning: Can save best model only with val_loss available, skipping.
  ' skipping.', RuntimeWarning)
/opt/conda/lib/python3.6/site-packages/pytorch_lightning/callbacks/pt_callbacks.py:144: RuntimeWarning: Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss,avg_val_loss
  RuntimeWarning)
1

Expected behavior

The 'val_loss' metric should be recognized when a dic with key val_loss is returned by validation_step()

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.7.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.4.3
[pip] numpy==1.18.1
[pip] numpydoc==0.9.2
[pip] pytorch-ignite==0.3.0
[pip] pytorch-lightning==0.6.0
[pip] pytorch-pretrained-bert==0.6.2
[pip] pytorch-transformers==1.1.0
[pip] torch==1.4.0
[pip] torchaudio==0.4.0a0+719bcc7
[pip] torchtext==0.5.0
[pip] torchvision==0.4.2
[conda] blas 1.0 mkl
[conda] cpuonly 1.0 0 pytorch
[conda] mkl 2019.3 199
[conda] mkl-service 2.0.2 py36h7b6447c_0
[conda] mkl_fft 1.0.12 py36ha843d7b_0
[conda] mkl_random 1.0.2 py36hd81dba3_0
[conda] pytorch 1.4.0 py3.6_cpu_0 [cpuonly] pytorch
[conda] pytorch-ignite 0.3.0 pypi_0 pypi
[conda] pytorch-lightning 0.6.0 pypi_0 pypi
[conda] pytorch-pretrained-bert 0.6.2 pypi_0 pypi
[conda] pytorch-transformers 1.1.0 pypi_0 pypi
[conda] torchaudio 0.4.0 py36 pytorch
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.4.2 pypi_0 pypi

Additional context

Something interesting is you cannot reproduce this when you run CoolSystem.
You won't get any warning because validation_end returns tensorboard logs with val_loss as a key. If you remove this log you will able to catch the bug.

For what its worth, I was trying to make a PyTorch Lightning based Kernel in Kaggle and WandB as logger. After trying "a lot" to find a mistake in my code, I realized that this warning is not shown every time. Once you run and get the warning and re-run the block (I was using Kaggle Kernel Notebook) it disappears. I assume that it is reading from some cache? Since the val_loss is something for the first epoch(maybe something from last epoch or log) and stay 0 for the rest. I am not familiar with the internal working of PL but I suspect there is some mix up between metrics that are logged and metrics returned by lightning methods.

@bsridatta bsridatta added bug Something isn't working help wanted Open to be worked on labels Feb 23, 2020
@github-actions
Copy link
Contributor

Hey, thanks for your contribution! Great first issue!

@awaelchli
Copy link
Contributor

The EarlyStopping looks for the key val_loss in the validation_end, not validation_step. That's a correct behaviour.
Possible fix 1: Pass in a ModelCheckpoint to Trainer and change the monitor argument to "avg_val_loss" (as returned by validation_end).
Possible fix 2: Change "avg_val_loss" to "val_loss" in validation_end.

@williamFalcon This is a recurring issue that users have with the provided examples. They should probably get updated so that the default EarlyStopping just works from the beginning without producing warnings.

@awaelchli
Copy link
Contributor

@bsridatta Actually, I just saw that you are linking an old documentation in your post. The examples are already updated. See here, the fix 2 I suggested is already done:
https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/basic_examples/lightning_module_template.py

@awaelchli
Copy link
Contributor

It got fixed here #524

@bsridatta
Copy link
Author

bsridatta commented Feb 24, 2020

Thanks for the quick response and clarification.

The EarlyStopping looks for the key val_loss in the validation_end, not validation_step

I understood this wrong. I have used the example I mentioned earlier in November and did even doubt that it might have changed. It's clear now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

2 participants