Validation loss in progress bar printed line by line #330

annemariet · 2019-10-08T11:47:19Z

Common bugs:

checked.

Describe the bug
When adding a "progress_bar" key to the validation_end output, the progress bar doesn't behave as expected and prints one line per iteration, eg:

80%|8| 3014/3750 [00:23<00:01, 516.63it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
82%|8| 3066/3750 [00:23<00:01, 517.40it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
83%|8| 3118/3750 [00:23<00:01, 516.65it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 85%|8| 3170/3750 [00:23<00:01, 517.42it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
86%|8| 3222/3750 [00:23<00:01, 517.59it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
87%|8| 3274/3750 [00:23<00:00, 518.00it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
89%|8| 3326/3750 [00:23<00:00, 518.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
90%|9| 3378/3750 [00:23<00:00, 518.45it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
91%|9| 3430/3750 [00:23<00:00, 518.36it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
93%|9| 3482/3750 [00:23<00:00, 518.02it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
94%|9| 3534/3750 [00:24<00:00, 517.26it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
96%|9| 3586/3750 [00:24<00:00, 517.68it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
97%|9| 3638/3750 [00:24<00:00, 518.08it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_ 
98%|9| 3690/3750 [00:24<00:00, 518.18it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|9| 3742/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_
100%|#| 3750/3750 [00:24<00:00, 518.23it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]
save callback...
100%|#| 3750/3750 [00:24<00:00, 152.16it/s, batch_nb=1874, epoch=14, gpu=0, loss=1.070, training_loss=0.792, val_loss=1.16]

To Reproduce
Steps to reproduce the behavior:

Take MNIST script minimal example (https://williamfalcon.github.io/pytorch-lightning/LightningModule/RequiredTrainerInterface/)
with some code to run it

if __name__ == "__main__":
    model = CoolModel()

    # most basic trainer, uses good defaults
    default_save_path = '/tmp/checkpoints/'
    trainer = pl.Trainer(default_save_path=default_save_path,
                         show_progress_bar=True)
    trainer.fit(model)

Change validation_end method to:

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tqdm_dict = {'val_loss': avg_loss}

        return {
                'progress_bar': tqdm_dict,
                'log': {'val_loss': avg_loss},
        }

Change training_step to:

    def training_step(self, batch, batch_nb):
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        output = {
            'loss': loss,  # required
            'progress_bar': {'training_loss': loss},  # optional (MUST ALL BE TENSORS)
        }
        return output

Run the script, see error at validation time.

Note that both steps 2 and 3 are necessary to reproduce the issue, each separately would run as expected.

Expected behavior
A progress bar on a single line.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Linux
Version
pytorch-lightning==0.5.1.3
torch==1.2.0

Additional context
Actually I ran into this issue after trying to add EarlyStopping, which asked for val_loss, which I found out was to be added via the progress_bar metrics... which was quite unexpected for me (I would have had it in "log" or direct key?)

The text was updated successfully, but these errors were encountered:

annemariet · 2019-10-08T11:53:42Z

I just checked if that was the version of tqdm by upgrading from tqdm==4.35.0 to tqdm==4.36.1, to no avail.

williamFalcon · 2019-10-08T11:55:51Z

@annemariet thanks for finding this. Are you using this in jupyter notebook? that might be the issue.

But on a usability note, we'll move the early stopping to use keys not in progress_bar or log. Good point!

annemariet · 2019-10-08T11:58:53Z

Hi, thanks for your quick reply. I'm running this from command line.

williamFalcon · 2019-10-09T12:25:13Z

@annemariet this can happen if you resize your terminal window during training. this is a tqdm bug, not PL bug.

Re the earlystopping, i just sent a fix yesterday where any of the keys NOT in "progress_bar" or "log" will be used for all callbacks. This is on master now.

I can reopen this if you are still having issues

ayberkydn · 2019-11-22T16:09:50Z

This actually still happens in Spyder, but works fine in terminal.

pytorch-lightning==0.5.3.2
spyder==3.3.6
spyder-kernels==0.5.2
tqdm==4.38.0
ipython==7.9.0

ChristofHenkel · 2020-01-22T06:02:55Z

Sorry, although I searched for it I had not seen it was already discussed here. I think its still an important open issue.

#721

sudarshan85 · 2020-01-22T19:53:27Z

I am using Jupyter notebook and this happens in there. Is there a fix for it in Jupyter notebook? I like to develop there before moving to the command line.

Borda · 2020-01-23T12:42:46Z

@sudarshan85 it is issue of TQDM, not lightning, we cannot do much about it, try to upgrade...

ChristofHenkel · 2020-01-23T13:16:01Z

@Borda nevertheless we could think of a solution to disable the val-progress bar individually, or otherwise give flexibility

sudarshan85 · 2020-01-23T14:12:09Z

I'm curious whether something like fastpgross could be included in Lightning. There is also tqdm_notebook. I wonder whether this can be passed into Lightning for using as progress bar.

wassname · 2020-01-25T05:03:26Z

One option is to use from tqdm.auto import tqdm this way it will use Ipython widgets when in the notebook.

Borda · 2020-01-25T11:43:47Z

is it just like it, import another tqdm class? Would you consider making a PR?

wassname · 2020-01-26T00:24:32Z

Sure, it's PR #752. There will be some edge cases where someone is in a notebook environment but doesn't have their widgets set up. In that case they will get a warning message about what to do and their progress bar wont show.

So could potentially add a training parameter to override this, which may help those people.

FYI: this is how tqdm does the notebook detection

* use tqdm.auto in trainer This will import the ipywidgets version of tqdm if available. This works nicely in notebooks by not filling up the log. In the terminal it will use the same old tqdm. We might also want to consider passing in the tqdm we want as an argument since there may be some edge cases where ipywidgets is available but the interface doesn't support it (e.g. vscode?) or isn't working. In which case people will get a warning message, but may want to configure it themselves. * use `from tqdm.auto` in eval loop * indents

Borda · 2020-03-02T23:15:10Z

I will close this in favour of #765 so pls let's continue the discussion there... 🤖

ZeyuSun · 2021-08-22T06:03:28Z

This may have introduced a bug. It seems to be caused by multiple workers in dataloader. The error messages are messy with a bunch of embedded exceptions, but it seems the origin is

AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f7e5938cee0>
Traceback (most recent call last):
  File "/home/user/.conda/envs/base/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1324, in __del__
    self._shutdown_workers()
  File "/home/user/.conda/envs/base/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1279, in _shutdown_workers
    self._pin_memory_thread.join()
  File "/home/user/.conda/envs/base/lib/python3.8/threading.py", line 1008, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

After finding the discussion here, I changed
https://github.com/PyTorchLightning/pytorch-lightning/blob/b1a859f312fb4ba7afa8861a316ba2e80e091680/pytorch_lightning/callbacks/progress.py#L32
to

from tqdm import tqdm as _tqdm

and it fixed my problem.

annemariet added the bug Something isn't working label Oct 8, 2019

williamFalcon mentioned this issue Oct 8, 2019

Earlystopping should use a key not in progress_bar or log #331

Closed

williamFalcon closed this as completed Oct 9, 2019

Borda assigned wassname Jan 25, 2020

Borda added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 25, 2020

Borda mentioned this issue Jan 25, 2020

Tqdm progress bar error #721

Closed

wassname mentioned this issue Jan 26, 2020

for #330, use tqdm.auto in trainer #752

Merged

4 tasks

graille mentioned this issue Feb 8, 2023

Validation loss in progress bar printed line by line (again) #16691

Closed

jivatneet mentioned this issue Jun 13, 2023

Adding general version of CACM py-why/dowhy#925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation loss in progress bar printed line by line #330

Validation loss in progress bar printed line by line #330

annemariet commented Oct 8, 2019 •

edited

Loading

annemariet commented Oct 8, 2019

williamFalcon commented Oct 8, 2019

annemariet commented Oct 8, 2019

williamFalcon commented Oct 9, 2019

ayberkydn commented Nov 22, 2019

ChristofHenkel commented Jan 22, 2020

sudarshan85 commented Jan 22, 2020

Borda commented Jan 23, 2020

ChristofHenkel commented Jan 23, 2020

sudarshan85 commented Jan 23, 2020 •

edited

Loading

wassname commented Jan 25, 2020

Borda commented Jan 25, 2020 •

edited

Loading

wassname commented Jan 26, 2020 •

edited

Loading

Borda commented Mar 2, 2020

ZeyuSun commented Aug 22, 2021 •

edited

Loading

Validation loss in progress bar printed line by line #330

Validation loss in progress bar printed line by line #330

Comments

annemariet commented Oct 8, 2019 • edited Loading

Common bugs:

annemariet commented Oct 8, 2019

williamFalcon commented Oct 8, 2019

annemariet commented Oct 8, 2019

williamFalcon commented Oct 9, 2019

ayberkydn commented Nov 22, 2019

ChristofHenkel commented Jan 22, 2020

sudarshan85 commented Jan 22, 2020

Borda commented Jan 23, 2020

ChristofHenkel commented Jan 23, 2020

sudarshan85 commented Jan 23, 2020 • edited Loading

wassname commented Jan 25, 2020

Borda commented Jan 25, 2020 • edited Loading

wassname commented Jan 26, 2020 • edited Loading

Borda commented Mar 2, 2020

ZeyuSun commented Aug 22, 2021 • edited Loading

annemariet commented Oct 8, 2019 •

edited

Loading

sudarshan85 commented Jan 23, 2020 •

edited

Loading

Borda commented Jan 25, 2020 •

edited

Loading

wassname commented Jan 26, 2020 •

edited

Loading

ZeyuSun commented Aug 22, 2021 •

edited

Loading