Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model validation code is not called #2351

Closed
Uroc327 opened this issue Jun 24, 2020 · 14 comments
Closed

Model validation code is not called #2351

Uroc327 opened this issue Jun 24, 2020 · 14 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@Uroc327
Copy link

Uroc327 commented Jun 24, 2020

🐛 Bug

My defined methods for validation_step as well as validation_epoch_end do not seem to get called.

To Reproduce

Just call the provided code sample. Python should show the NotImplementedError. Instead the model completes 'successfully'.

Code sample

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader


class Dataset(torch.utils.data.IterableDataset):
    def __init__(self):
        super().__init__()

    def __iter__(self):
        def get_sample():
            for _ in range(5):
                yield torch.randn(20)
        return get_sample()

    def __len__(self):
        return 5

class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()

        self.enc = nn.Linear(20, 10)
        self.dec = nn.Linear(10, 20)

    def forward(self, x):
        x = self.enc(x)
        x = F.relu(x)
        x = self.dec(x)
        return x

    def training_step(self, batch, batchIdx):
        x = self.forward(batch)
        return {'loss': torch.mean(x)}

    def validation_step(self, batch, batchIdx):
        raise NotImplementedError()
        x = self.forward(batch)
        return {'val_loss': torch.mean(x)}

    def validation_epoch_end(self, outputs):
        return {'val_loss': torch.mean(torch.stack([x['val_loss'] for x in outputs]))}

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters())

if __name__ == '__main__':
    trainer = pl.Trainer(num_sanity_val_steps=0)

    net = Model()
    dataset = Dataset()

    trainer.fit(net, train_dataloader=DataLoader(dataset, batch_size=8, num_workers=0), val_dataloaders=DataLoader(dataset, batch_size=8, num_workers=0))

Expected behavior

Instead, I'd expect the code sample above to fail.

Environment

Collecting environment information...
PyTorch version: 1.5.1+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Gentoo Base System release 2.7
GCC version: (Gentoo 9.3.0 p1) 9.3.0
CMake version: version 3.17.3

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GT 730
Nvidia driver version: 440.82
cuDNN version: /opt/cuda/targets/x86_64-linux/lib/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.19.0
[pip3] pytorch-lightning==0.8.1
[pip3] torch==1.5.1+cu101
[pip3] torchvision==0.6.1+cu101
[conda] Could not collect
@Uroc327 Uroc327 added bug Something isn't working help wanted Open to be worked on labels Jun 24, 2020
@Uroc327
Copy link
Author

Uroc327 commented Jun 25, 2020

As a note, the validation code is called twice when leaving out the num_sanity_val_steps=0 parameter. After the two sanity runs, it is not called anymore.

@williamFalcon
Copy link
Contributor

Your dataset is wrong. That's causing the issue.

DataLoader(dataset, batch_size=8, num_workers=0, drop_last=True)

since your batch size and dataset length don't work out well.

@Uroc327
Copy link
Author

Uroc327 commented Jun 25, 2020

@williamFalcon Changing the dataset to produce 128 samples does not change this either:

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader


class Dataset(torch.utils.data.IterableDataset):
    def __init__(self):
        super().__init__()

    def __iter__(self):
        def get_sample():
            for _ in range(128):
                yield torch.randn(20)
        return get_sample()

    def __len__(self):
        return 128

class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()

        self.enc = nn.Linear(20, 10)
        self.dec = nn.Linear(10, 20)

    def forward(self, x):
        x = self.enc(x)
        x = F.relu(x)
        x = self.dec(x)
        return x

    def training_step(self, batch, batchIdx):
        x = self.forward(batch)
        return {'loss': torch.mean(x)}

    def validation_step(self, batch, batchIdx):
        raise NotImplementedError()
        x = self.forward(batch)
        return {'val_loss': torch.mean(x)}

    def validation_epoch_end(self, outputs):
        return {'val_loss': torch.mean(torch.stack([x['val_loss'] for x in outputs]))}

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters())

if __name__ == '__main__':
    trainer = pl.Trainer(num_sanity_val_steps=0)

    net = Model()
    dataset = Dataset()

    trainer.fit(net, train_dataloader=DataLoader(dataset, batch_size=8, num_workers=0), val_dataloaders=DataLoader(dataset, batch_size=8, num_workers=0))
> python main.py
GPU available: True, used: False
TPU available: False, using: 0 TPU cores

  | Name | Type   | Params
--------------------------------
0 | enc  | Linear | 210   
1 | dec  | Linear | 220   
/home/constantin/.virtualenvs/tensor/lib64/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/home/constantin/.virtualenvs/tensor/lib64/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 1000:   6%|███████████████████████▊                                                                                                                                                                                                                                                                                                                                                                     | 16/256 [00:00<00:00, 1252.80it/s, loss=-2894.485, v_num=7]
>

@Uroc327
Copy link
Author

Uroc327 commented Jun 25, 2020

Also changing the drop_last option does not make any difference (on 128 samples).
Could we please reopen this, @williamFalcon ?

@evanatyourservice
Copy link

Same. I have to set the validation to a minimal number... I can't train on all the data and leave validation out when training with TPU, it hangs after first epoch. To narrow it down, this is a problem with TPU because it works fine with CPU and GPU.

@Steve-Tod
Copy link

Same problem here. I use val_dataloader method of 'pl.LightningModule', but this method is never called. So I specify trainer.fit(model, val_dataloaders=model.val_dataloader()), and set break points in validataion_step function, but it's also never called....

@ihowell
Copy link

ihowell commented Jul 15, 2021

@Steve-Tod Can you check the spelling on your validation_step function? If you copy pasted, then it's slightly off. It is kind of annoying that misspellings here create silent errors.

@SamPusegaonkar
Copy link

Any update on this?

@akihironitta
Copy link
Contributor

Any update on this?

@SamPusegaonkar What's your issue? Cannot reproduce the behaviour in the description of this issue. Could you create another issue since this is quite outdated?

@SamPusegaonkar
Copy link

#13726 (reply in thread)
Please check out this

@RylanSchaeffer
Copy link

RylanSchaeffer commented Aug 6, 2022

Nevermind. Fixed my error - I wasn't passing a validation dataloader.

That said, I think that if someone has a validation function but no validation dataloader passed, surely an error should be thrown?

@lee-junjie
Copy link

lee-junjie commented Feb 19, 2024

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

@HKAB
Copy link

HKAB commented Jun 19, 2024

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

Thanks @lee-junjie! It helps me debug my issues. It's annoying because it doesn't throw any warning.

@zshuyinggg
Copy link

I my case, Lightning skip validation when I set a limit_train_batches larger than available batches.

This inspired me to solve my issue! Turned out that my length of my customized sampler is longer than actual batches. After I corrected the len() of my batchsampler, it works!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests