Model validation code is not called #2351

Uroc327 · 2020-06-24T20:54:04Z

🐛 Bug

My defined methods for validation_step as well as validation_epoch_end do not seem to get called.

To Reproduce

Just call the provided code sample. Python should show the NotImplementedError. Instead the model completes 'successfully'.

Code sample

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader


class Dataset(torch.utils.data.IterableDataset):
    def __init__(self):
        super().__init__()

    def __iter__(self):
        def get_sample():
            for _ in range(5):
                yield torch.randn(20)
        return get_sample()

    def __len__(self):
        return 5

class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()

        self.enc = nn.Linear(20, 10)
        self.dec = nn.Linear(10, 20)

    def forward(self, x):
        x = self.enc(x)
        x = F.relu(x)
        x = self.dec(x)
        return x

    def training_step(self, batch, batchIdx):
        x = self.forward(batch)
        return {'loss': torch.mean(x)}

    def validation_step(self, batch, batchIdx):
        raise NotImplementedError()
        x = self.forward(batch)
        return {'val_loss': torch.mean(x)}

    def validation_epoch_end(self, outputs):
        return {'val_loss': torch.mean(torch.stack([x['val_loss'] for x in outputs]))}

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters())

if __name__ == '__main__':
    trainer = pl.Trainer(num_sanity_val_steps=0)

    net = Model()
    dataset = Dataset()

    trainer.fit(net, train_dataloader=DataLoader(dataset, batch_size=8, num_workers=0), val_dataloaders=DataLoader(dataset, batch_size=8, num_workers=0))

Expected behavior

Instead, I'd expect the code sample above to fail.

Environment

Collecting environment information...
PyTorch version: 1.5.1+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Gentoo Base System release 2.7
GCC version: (Gentoo 9.3.0 p1) 9.3.0
CMake version: version 3.17.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GT 730
Nvidia driver version: 440.82
cuDNN version: /opt/cuda/targets/x86_64-linux/lib/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.19.0
[pip3] pytorch-lightning==0.8.1
[pip3] torch==1.5.1+cu101
[pip3] torchvision==0.6.1+cu101
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

Uroc327 · 2020-06-25T11:10:24Z

As a note, the validation code is called twice when leaving out the num_sanity_val_steps=0 parameter. After the two sanity runs, it is not called anymore.

williamFalcon · 2020-06-25T14:06:20Z

Your dataset is wrong. That's causing the issue.

DataLoader(dataset, batch_size=8, num_workers=0, drop_last=True)

since your batch size and dataset length don't work out well.

Uroc327 · 2020-06-25T14:10:02Z

@williamFalcon Changing the dataset to produce 128 samples does not change this either:

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader


class Dataset(torch.utils.data.IterableDataset):
    def __init__(self):
        super().__init__()

    def __iter__(self):
        def get_sample():
            for _ in range(128):
                yield torch.randn(20)
        return get_sample()

    def __len__(self):
        return 128

class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()

        self.enc = nn.Linear(20, 10)
        self.dec = nn.Linear(10, 20)

    def forward(self, x):
        x = self.enc(x)
        x = F.relu(x)
        x = self.dec(x)
        return x

    def training_step(self, batch, batchIdx):
        x = self.forward(batch)
        return {'loss': torch.mean(x)}

    def validation_step(self, batch, batchIdx):
        raise NotImplementedError()
        x = self.forward(batch)
        return {'val_loss': torch.mean(x)}

    def validation_epoch_end(self, outputs):
        return {'val_loss': torch.mean(torch.stack([x['val_loss'] for x in outputs]))}

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters())

if __name__ == '__main__':
    trainer = pl.Trainer(num_sanity_val_steps=0)

    net = Model()
    dataset = Dataset()

    trainer.fit(net, train_dataloader=DataLoader(dataset, batch_size=8, num_workers=0), val_dataloaders=DataLoader(dataset, batch_size=8, num_workers=0))

> python main.py
GPU available: True, used: False
TPU available: False, using: 0 TPU cores

  | Name | Type   | Params
--------------------------------
0 | enc  | Linear | 210   
1 | dec  | Linear | 220   
/home/constantin/.virtualenvs/tensor/lib64/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/home/constantin/.virtualenvs/tensor/lib64/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 1000:   6%|███████████████████████▊                                                                                                                                                                                                                                                                                                                                                                     | 16/256 [00:00<00:00, 1252.80it/s, loss=-2894.485, v_num=7]
>

Uroc327 · 2020-06-25T15:05:59Z

Also changing the drop_last option does not make any difference (on 128 samples).
Could we please reopen this, @williamFalcon ?

evanatyourservice · 2021-02-16T13:46:24Z

Same. I have to set the validation to a minimal number... I can't train on all the data and leave validation out when training with TPU, it hangs after first epoch. To narrow it down, this is a problem with TPU because it works fine with CPU and GPU.

Steve-Tod · 2021-06-26T16:49:32Z

Same problem here. I use val_dataloader method of 'pl.LightningModule', but this method is never called. So I specify trainer.fit(model, val_dataloaders=model.val_dataloader()), and set break points in validataion_step function, but it's also never called....

ihowell · 2021-07-15T16:18:52Z

@Steve-Tod Can you check the spelling on your validation_step function? If you copy pasted, then it's slightly off. It is kind of annoying that misspellings here create silent errors.

SamPusegaonkar · 2022-07-19T05:44:47Z

Any update on this?

akihironitta · 2022-07-19T06:29:03Z

Any update on this?

@SamPusegaonkar What's your issue? Cannot reproduce the behaviour in the description of this issue. Could you create another issue since this is quite outdated?

SamPusegaonkar · 2022-07-19T17:36:19Z

#13726 (reply in thread)
Please check out this

RylanSchaeffer · 2022-08-06T21:37:47Z

Nevermind. Fixed my error - I wasn't passing a validation dataloader.

That said, I think that if someone has a validation function but no validation dataloader passed, surely an error should be thrown?

lee-junjie · 2024-02-19T07:15:28Z

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

HKAB · 2024-06-19T10:27:14Z

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

Thanks @lee-junjie! It helps me debug my issues. It's annoying because it doesn't throw any warning.

zshuyinggg · 2024-07-24T01:03:13Z

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

This inspired me to solve my issue! Turned out that my length of my customized sampler is longer than actual batches. After I corrected the len() of my batchsampler, it works!!!!

Uroc327 added bug Something isn't working help wanted Open to be worked on labels Jun 24, 2020

williamFalcon closed this as completed Jun 25, 2020

Uroc327 mentioned this issue Jun 30, 2020

Batched iterative dataloading disables validation #2429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model validation code is not called #2351

Model validation code is not called #2351

Uroc327 commented Jun 24, 2020

Uroc327 commented Jun 25, 2020

williamFalcon commented Jun 25, 2020

Uroc327 commented Jun 25, 2020 •

edited

Loading

Uroc327 commented Jun 25, 2020

evanatyourservice commented Feb 16, 2021

Steve-Tod commented Jun 26, 2021

ihowell commented Jul 15, 2021

SamPusegaonkar commented Jul 19, 2022

akihironitta commented Jul 19, 2022

SamPusegaonkar commented Jul 19, 2022

RylanSchaeffer commented Aug 6, 2022 •

edited

Loading

lee-junjie commented Feb 19, 2024 •

edited

Loading

HKAB commented Jun 19, 2024 •

edited

Loading

zshuyinggg commented Jul 24, 2024

Model validation code is not called #2351

Model validation code is not called #2351

Comments

Uroc327 commented Jun 24, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Uroc327 commented Jun 25, 2020

williamFalcon commented Jun 25, 2020

Uroc327 commented Jun 25, 2020 • edited Loading

Uroc327 commented Jun 25, 2020

evanatyourservice commented Feb 16, 2021

Steve-Tod commented Jun 26, 2021

ihowell commented Jul 15, 2021

SamPusegaonkar commented Jul 19, 2022

akihironitta commented Jul 19, 2022

SamPusegaonkar commented Jul 19, 2022

RylanSchaeffer commented Aug 6, 2022 • edited Loading

lee-junjie commented Feb 19, 2024 • edited Loading

HKAB commented Jun 19, 2024 • edited Loading

zshuyinggg commented Jul 24, 2024

Uroc327 commented Jun 25, 2020 •

edited

Loading

RylanSchaeffer commented Aug 6, 2022 •

edited

Loading

lee-junjie commented Feb 19, 2024 •

edited

Loading

HKAB commented Jun 19, 2024 •

edited

Loading