Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation loss in progress bar printed line by line #330

Closed
annemariet opened this issue Oct 8, 2019 · 15 comments · Fixed by #752
Closed

Validation loss in progress bar printed line by line #330

annemariet opened this issue Oct 8, 2019 · 15 comments · Fixed by #752
Assignees
Labels
bug Something isn't working feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@annemariet
Copy link

annemariet commented Oct 8, 2019

Common bugs:

checked.

Describe the bug
When adding a "progress_bar" key to the validation_end output, the progress bar doesn't behave as expected and prints one line per iteration, eg:

80%|8| 3014/3750 [00:23<00:01, 516.63it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
82%|8| 3066/3750 [00:23<00:01, 517.40it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
83%|8| 3118/3750 [00:23<00:01, 516.65it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 85%|8| 3170/3750 [00:23<00:01, 517.42it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
86%|8| 3222/3750 [00:23<00:01, 517.59it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
87%|8| 3274/3750 [00:23<00:00, 518.00it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
89%|8| 3326/3750 [00:23<00:00, 518.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
90%|9| 3378/3750 [00:23<00:00, 518.45it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
91%|9| 3430/3750 [00:23<00:00, 518.36it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
93%|9| 3482/3750 [00:23<00:00, 518.02it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
94%|9| 3534/3750 [00:24<00:00, 517.26it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
96%|9| 3586/3750 [00:24<00:00, 517.68it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
97%|9| 3638/3750 [00:24<00:00, 518.08it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
98%|9| 3690/3750 [00:24<00:00, 518.18it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|9| 3742/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|#| 3750/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]
save callback...
100%|#| 3750/3750 [00:24<00:00, 152.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]

To Reproduce
Steps to reproduce the behavior:

  1. Take MNIST script minimal example (https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/)
    with some code to run it
if __name__ == "__main__":
    model = CoolModel()

    # most basic trainer, uses good defaults
    default_save_path = '/tmp/checkpoints/'
    trainer = pl.Trainer(default_save_path=default_save_path,
                         show_progress_bar=True)
    trainer.fit(model)
  1. Change validation_end method to:
    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tqdm_dict = {'val_loss': avg_loss}

        return {
                'progress_bar': tqdm_dict,
                'log': {'val_loss': avg_loss},
        }
  1. Change training_step to:
    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        output = {
            'loss': loss,  # required
            'progress_bar': {'training_loss': loss},  # optional (MUST ALL BE TENSORS)
        }
        return output
  1. Run the script, see error at validation time.

Note that both steps 2 and 3 are necessary to reproduce the issue, each separately would run as expected.

Expected behavior
A progress bar on a single line.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Linux
  • Version
    pytorch-lightning==0.5.1.3
    torch==1.2.0

Additional context
Actually I ran into this issue after trying to add EarlyStopping, which asked for val_loss, which I found out was to be added via the progress_bar metrics... which was quite unexpected for me (I would have had it in "log" or direct key?)

@annemariet annemariet added the bug Something isn't working label Oct 8, 2019
@annemariet
Copy link
Author

I just checked if that was the version of tqdm by upgrading from tqdm==4.35.0 to tqdm==4.36.1, to no avail.

@williamFalcon
Copy link
Contributor

@annemariet thanks for finding this. Are you using this in jupyter notebook? that might be the issue.

But on a usability note, we'll move the early stopping to use keys not in progress_bar or log. Good point!

@annemariet
Copy link
Author

Hi, thanks for your quick reply. I'm running this from command line.

@williamFalcon
Copy link
Contributor

@annemariet this can happen if you resize your terminal window during training. this is a tqdm bug, not PL bug.

Re the earlystopping, i just sent a fix yesterday where any of the keys NOT in "progress_bar" or "log" will be used for all callbacks. This is on master now.

I can reopen this if you are still having issues

@ayberkydn
Copy link

This actually still happens in Spyder, but works fine in terminal.

pytorch-lightning==0.5.3.2
spyder==3.3.6
spyder-kernels==0.5.2
tqdm==4.38.0
ipython==7.9.0

@ChristofHenkel
Copy link

Sorry, although I searched for it I had not seen it was already discussed here. I think its still an important open issue.

#721

@sudarshan85
Copy link

I am using Jupyter notebook and this happens in there. Is there a fix for it in Jupyter notebook? I like to develop there before moving to the command line.

@Borda
Copy link
Member

Borda commented Jan 23, 2020

@sudarshan85 it is issue of TQDM, not lightning, we cannot do much about it, try to upgrade...

@ChristofHenkel
Copy link

@Borda nevertheless we could think of a solution to disable the val-progress bar individually, or otherwise give flexibility

@sudarshan85
Copy link

sudarshan85 commented Jan 23, 2020

I'm curious whether something like fastpgross could be included in Lightning. There is also tqdm_notebook. I wonder whether this can be passed into Lightning for using as progress bar.

@wassname
Copy link
Contributor

One option is to use from tqdm.auto import tqdm this way it will use Ipython widgets when in the notebook.

@Borda
Copy link
Member

Borda commented Jan 25, 2020

is it just like it, import another tqdm class? Would you consider making a PR?

@Borda Borda added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 25, 2020
@wassname
Copy link
Contributor

wassname commented Jan 26, 2020

Sure, it's PR #752. There will be some edge cases where someone is in a notebook environment but doesn't have their widgets set up. In that case they will get a warning message about what to do and their progress bar wont show.

So could potentially add a training parameter to override this, which may help those people.

FYI: this is how tqdm does the notebook detection

williamFalcon pushed a commit that referenced this issue Jan 26, 2020
* use tqdm.auto in trainer

This will import the ipywidgets version of tqdm if available. This works nicely in notebooks by not filling up the log.

In the terminal it will use the same old tqdm.

We might also want to consider passing in the tqdm we want as an argument since there may be some edge cases where ipywidgets is available but the interface doesn't support it (e.g. vscode?) or isn't working. In which case people will get a warning message, but may want to configure it themselves.

* use `from tqdm.auto` in eval loop

* indents
@Borda
Copy link
Member

Borda commented Mar 2, 2020

I will close this in favour of #765 so pls let's continue the discussion there... 🤖

@ZeyuSun
Copy link

ZeyuSun commented Aug 22, 2021

This may have introduced a bug. It seems to be caused by multiple workers in dataloader. The error messages are messy with a bunch of embedded exceptions, but it seems the origin is

AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f7e5938cee0>
Traceback (most recent call last):
  File "/home/user/.conda/envs/base/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1324, in __del__
    self._shutdown_workers()
  File "/home/user/.conda/envs/base/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1279, in _shutdown_workers
    self._pin_memory_thread.join()
  File "/home/user/.conda/envs/base/lib/python3.8/threading.py", line 1008, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

After finding the discussion here, I changed
https://github.com/PyTorchLightning/pytorch-lightning/blob/b1a859f312fb4ba7afa8861a316ba2e80e091680/pytorch_lightning/callbacks/progress.py#L32
to

from tqdm import tqdm as _tqdm

and it fixed my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants