Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The total number of batches shows by the progress bar of the sanity check is wrong #2891

Closed
manipopopo opened this issue Aug 9, 2020 · 8 comments · Fixed by #3751
Closed
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@manipopopo
Copy link
Contributor

manipopopo commented Aug 9, 2020

🐛 Bug

The total of the sanity check progress bar is set by
https://github.com/PyTorchLightning/pytorch-lightning/blob/4d0406ec8bf1c9147b34eb607411b78a9cd28243/pytorch_lightning/callbacks/progress.py#L296

The progress bar will always show trainer.num_sanity_val_steps even if the length of the validation DataLoader is less than trainer.num_sanity_val_steps.

Maybe the total could be computed by

from pytorch_lightning.trainer import data_loading

num_full_val_dataloader_batches = [
    len(dataloader) if data_loading._has_len(dataloader) else float('inf')
    for dataloader in trainer.val_dataloaders
]
self.val_progress_bar.total = convert_inf(
    sum(min(num_batches, trainer.num_sanity_val_steps)
            for num_batches in num_full_val_dataloader_batches))

We use the private function data_loading._has_len to check if dataloader has __len__, maybe we could make data_loading._has_len public.

Or we could make num_full_val_dataloader_batches (and num_full_train_dataloader_batches) a member variable of Trainer and update the value in pytorch_lightning.trainer.data_loading.TrainerDataLoadingMixin.

To Reproduce

The progress bar of the sanity check in the following code (num_sanity_val_steps == 999 and len(val_data_loader) == 10) shows

Validation sanity check:   1%|          | 9/999 [00:09<16:31,  1.00s/it]`

Code sample

import time

import pytorch_lightning as pl
from torch.utils import data


class Dataset(data.Dataset):

  def __init__(self, length):
    self._elements = list(range(length))

  def __getitem__(self, item):
    return self._elements[item]

  def __len__(self):
    return len(self._elements)


class Model(pl.LightningModule):

  def forward(self, *args, **kwargs):
    pass

  def training_step(self, *args, **kwargs):
    pass

  def train_dataloader(self):
    pass

  def configure_optimizers(self):
    pass

  def validation_step(self, *args, **kwargs):
    time.sleep(1)
    return pl.EvalResult()


if __name__ == '__main__':
  model = Model()

  val_dataset_length = 10
  val_dataset = Dataset(val_dataset_length)
  val_data_loader = data.DataLoader(val_dataset)

  trainer = pl.Trainer(num_sanity_val_steps=999, limit_val_batches=999,
                       max_epochs=0)
  trainer.fit(model, val_dataloaders=val_data_loader)

Expected behavior

The program above should be

Validation sanity check: 100%|██████████| 10/10 [00:10<00:00,  1.00s/it]

Environment

  • CUDA:
    • GPU:
    • available:
    • version:
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0+cpu
    • pytorch-lightning: 0.9.0rc11
    • tensorboard: 1.15.0
    • tqdm: 4.48.2
  • System:
    • OS: Windows
    • architecture:
      • 64bit
      • WindowsPE
    • processor:
    • python: 3.7.3
    • version: 10.0.18362

Additional context

@manipopopo manipopopo added bug Something isn't working help wanted Open to be worked on labels Aug 9, 2020
@ananyahjha93 ananyahjha93 self-assigned this Aug 9, 2020
@ananyahjha93
Copy link
Contributor

@manipopopo I guess the first solution gets things done without changing anything else. Mind submitting a PR?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Aug 9, 2020

I think once #2882 gets resolved this issue will be resolved automatically.

@manipopopo
Copy link
Contributor Author

manipopopo commented Aug 9, 2020

It seems that resolve #2882 will make the sanity check of Trainer(num_sanity_val_steps=5) run exactly 5 steps (or 5 * len(self.val_dataloaders) ) if the validation DataLoader has at least 5 batches.

Resolve this issue will make the total of the progress bar for the sanity check be the size of the validation DataLoader if it is less than num_sanity_val_steps (except num_sanity_val_steps == -1).

@rohitgr7
Copy link
Contributor

rohitgr7 commented Aug 9, 2020

then I would suggest fixing #2882 first in which we can make self.num_sanity_val_steps to be a list of correct num_steps for each val_dataloader and then we can just do sum(self.num_sanity_val_steps) here to assign it to progress_bar.

@manipopopo
Copy link
Contributor Author

manipopopo commented Aug 9, 2020

It seems that we have several ways to tackle the problem.

  • Make self.num_sanity_val_steps a list of correct numbers of steps as @rohitgr7 suggests. ProgressBar can update self.val_progress_bar.total using self.num_sanity_val_steps. Accessing the member num_sanity_val_steps may get values different from the one passed into Trainer.__init__.

  • Add a new member to save a list of correct numbers of steps or save a list of the sizes of validation DataLoaders. ProgressBar can update self.val_progress_bar.total using the new member. Users of Trainer can still get num_sanity_val_steps passed into Trainer.__init__ by accessing the member num_sanity_val_steps. (Supposing Int num_sanity_val_steps is always replaced by float limit_val_batches #2882 is resolved and the member num_sanity_val_steps is independent of limit_val_batches )

  • Compute the correct number of steps in the ProgressBar with the help of pytorch_lightning.trainer.data_loading._has_len

It seems that the values of the public member Trainer.num_sanity_val_steps in stable 0.8.5 and master are different.

@manipopopo
Copy link
Contributor Author

Fixed by #2917.

@rohitgr7 rohitgr7 reopened this Aug 27, 2020
@edenlightning edenlightning added the priority: 0 High priority task label Sep 23, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 23, 2020
@rohitgr7 rohitgr7 self-assigned this Sep 30, 2020
@rohitgr7
Copy link
Contributor

this issue is annoying. will fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
4 participants