Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tqdm progress bar #765

Closed
hadim opened this issue Jan 29, 2020 · 35 comments
Closed

Improve tqdm progress bar #765

hadim opened this issue Jan 29, 2020 · 35 comments
Assignees
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Milestone

Comments

@hadim
Copy link
Contributor

hadim commented Jan 29, 2020

At the moment the progress bar is initialized with the arg leave=False: https://github.com/PyTorchLightning/pytorch-lightning/blob/deffbaba7ffb16ff57b56fe65f62df761f25fbd6/pytorch_lightning/trainer/trainer.py#L861

Sometimes, it's nice to be able to see the previous progress bar to look at the evolution of the loss and metrics.

Would that be possible to add an arg to the trainer to be able to override default tqdm parameters?

Also, another point: tqdm progress bars can be nested (https://github.com/tqdm/tqdm#nested-progress-bars). Could we imagine having a global progress bar and then a nested progress bar for each epoch loop?

@hadim hadim added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 29, 2020
@hadim
Copy link
Contributor Author

hadim commented Feb 8, 2020

Another nice addition would be a global progress bar to have an ETA for the end of the whole training. Maybe a more general way to address this issue is to abstract the use of the progress bar in Trainer (with a callback system for example), so people can extend and tweak progress bar usage as they need.

@Borda
Copy link
Member

Borda commented Feb 10, 2020

@hadim sounds interesting, do you have any particular implementation in mind?
Would you mind to make a PR? =)

@hadim
Copy link
Contributor Author

hadim commented Feb 10, 2020

I think the progress bar should not be hardcoded in the trainer but abstracted in a callback. Once #776 is merged I could have a look if it's possible with the current API.

More generally the loggers should also be callbacks IMO. That being said it's easy to propose when you're not in charge :-)

I'll try to make a PR once #776 is merged.

@Borda Borda removed the help wanted Open to be worked on label Feb 11, 2020
@Borda
Copy link
Member

Borda commented Mar 2, 2020

@hadim are still interested in implementing this progress bar?

@Borda Borda added this to the 0.7.1 milestone Mar 2, 2020
@hadim
Copy link
Contributor Author

hadim commented Mar 3, 2020

I've made a custom progress bar as a callback and it works well for my needs. Not sure it will fit everyone's needs.

from tqdm.auto import tqdm

import torch
from pytorch_lightning.callbacks import Callback


class ProgressBar(Callback):
    """Global progress bar.
    TODO: add progress bar for training, validation and testing loop.
    """

    def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
        super().__init__()

        self.global_progress = global_progress
        self.global_desc = "Epoch: {epoch}/{max_epoch}"
        self.leave_global_progress = leave_global_progress
        self.global_pb = None

    def on_fit_start(self, trainer, pl_module):
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)

        self.global_pb = tqdm(
            desc=desc,
            total=trainer.max_epochs,
            initial=trainer.current_epoch,
            leave=self.leave_global_progress,
            disable=not self.global_progress,
        )

    def on_fit_end(self, trainer, pl_module):
        self.global_pb.close()
        self.global_pb = None

    def on_epoch_end(self, trainer, pl_module):

        # Set description
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
        self.global_pb.set_description(desc)

        # Set logs and metrics
        logs = pl_module.logs
        for k, v in logs.items():
            if isinstance(v, torch.Tensor):
                logs[k] = v.squeeze().item()
        self.global_pb.set_postfix(logs)

        # Update progress
        self.global_pb.update(1)

Only a global progress bar is implemented at the moment.

I could make a PR but some people might prefer the original one so I don't know if it's worth it.

@Borda
Copy link
Member

Borda commented Mar 3, 2020

yeah it looks the much cleaner way that using the callback driven progress bar then checking the for loop wrapped by tqdm

@danieltudosiu
Copy link

May I also add that I find the tqdm progress bar starting weirdly, with a percentage equal with 6% just after a single batch. And the progress bar shows final value of 790, but if I am to calculate it by hand an epoch either has 528 or 1056 (either one pass or one forward, one backward).

@williamFalcon
Copy link
Contributor

the bar shows the sum of train + val

@danieltudosiu
Copy link

the bar shows the sum of train + val

Sorry, I do not follow, I was referring to the progress counter being of, like after a single batch it shows:

Epoch 1: 6%|▋ | 50/790 [00:09<02:19, 5.29it/s, loss=3623526.000, training_loss=3.62e+6, v_num=0]0.0
When the batch size is 4 and neither my training, validation or training+validation sets have 790 batches.

@williamFalcon
Copy link
Contributor

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

@williamFalcon
Copy link
Contributor

@hadim i think abstracting the current progress bar into a callback would be cool. then as you said, the user can modify it however they want by overriding parts of the callback.

@danieltudosiu
Copy link

50/790 = 6%.
the progress bar updates in intervals of 50 batches. at batch 51 it will say 12%.

you can change that argument from 50 to 1 (bar refresh rate)

Yes, but that jump to 50 happens after only 1 batch. Shouldn't it stay at 0 until batch no 50?

@hadim
Copy link
Contributor Author

hadim commented Mar 10, 2020

@williamFalcon: I agree this should be done in a callback. Not sure I'll have time to do that in the short term but anyone is free to use my code above.

@elkotito
Copy link
Contributor

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|███████▉  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|████████▏ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|████████▎ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|████████▌ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|████████▋ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|████████▉ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|█████████ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|█████████▎| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|█████████▍| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|█████████▋| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|██████████| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

@Borda
Copy link
Member

Borda commented Mar 25, 2020

I was observing something similar in other projects and it is hard to determine, sometimes id cased by debug mode (eg in PyCharm)... but this it TQDM related thing, I think that we can't do anything about it... :[

@Borda
Copy link
Member

Borda commented Mar 26, 2020

@hadim still willing to implement #765 (comment) ?
@danieltudosiu default was changed in #1100
@mateuszpieniak it is TQDM issue, we cannot do much about it...
also, the TQM default was changed in #749

@Borda Borda unassigned hadim Mar 26, 2020
@Borda Borda added good first issue Good for newcomers help wanted Open to be worked on labels Mar 26, 2020
@hadim
Copy link
Contributor Author

hadim commented Mar 26, 2020

Sorry @Borda but this is not a good moment for me to do that.

@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020
@Borda
Copy link
Member

Borda commented Apr 8, 2020

@awaelchli may you self-assign also this one as they are almost the same...

@awaelchli
Copy link
Member

@Borda yes, could you assign me (can't self-assign) :)

This was referenced Apr 10, 2020
@awaelchli
Copy link
Member

The progress bar is now a callback #1450 . What remains is the question whether there should be an additional global progress bar (as suggested by @hadim) or if it is left to the user to extend such a feature.

@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020
@Borda
Copy link
Member

Borda commented May 13, 2020

@awaelchli I would assume to be closed by #1450 and if we find we need something else we will all it later... anyway feel to reopen you we are (I am) missing something 🐰

@achinta
Copy link

achinta commented Oct 30, 2020

A bit related question, should a progress bar look like below? It creates a "list of progress bars" when it switches to evaluation mode.

Epoch 9:  79%|███████▉  | 691/870 [00:07<00:01, 91.55batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Validating:   0%|          | 0/179 [00:00<?, ?batch/s]
Epoch 9:  81%|████████▏ | 708/870 [00:07<00:01, 105.89batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  83%|████████▎ | 724/870 [00:07<00:01, 117.67batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  85%|████████▌ | 741/870 [00:08<00:01, 128.86batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  87%|████████▋ | 758/870 [00:08<00:00, 137.11batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  89%|████████▉ | 775/870 [00:08<00:00, 145.04batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  91%|█████████ | 792/870 [00:08<00:00, 149.82batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  93%|█████████▎| 809/870 [00:08<00:00, 153.57batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  95%|█████████▍| 826/870 [00:08<00:00, 155.73batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9:  97%|█████████▋| 842/870 [00:08<00:00, 148.12batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]
Epoch 9: 100%|██████████| 870/870 [00:08<00:00, 152.45batch/s, batch_idx=690, gpu=0, loss=0.012, training_loss=0.0161, v_num=113]

Any suggestions on how to resolve this?

@awaelchli
Copy link
Member

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

@achinta
Copy link

achinta commented Oct 30, 2020

In which terminal emulator are you running this?
I often see this tqdm behavior in PyCharm and as far as I know we can't do anything about it. It's a tqdm issue.

I ran it on zsh and bash. tqdm==4.48.2, pytorch-lightning==1.0.0

@jzazo
Copy link

jzazo commented Nov 23, 2020

I am seeing this behavior in jupyterlab as well:

Epoch 1:  54%|█████▍    | 4271/7859 [04:08<03:33, 16.81it/s, loss=0.545, v_num=0]
Epoch 1:  55%|█████▍    | 4287/7859 [04:08<03:31, 16.87it/s, loss=0.545, v_num=0]
Epoch 1:  55%|█████▍    | 4303/7859 [04:09<03:30, 16.89it/s, loss=0.545, v_num=0]
Validating:   7%|▋         | 258/3809 [00:02<01:12, 49.27it/s]
Epoch 1:  55%|█████▍    | 4319/7859 [04:10<03:29, 16.90it/s, loss=0.545, v_num=0]
Validating:   7%|▋         | 274/3809 [00:03<02:08, 27.59it/s]
Validating:   7%|▋         | 280/3809 [00:03<02:06, 27.90it/s]
Epoch 1:  55%|█████▌    | 4335/7859 [04:10<03:28, 16.92it/s, loss=0.545, v_num=0]

The progress bar seems to work well when testing, in trainer.test(model, dm), also tuning lr shows correct progress bar, but not when fitting. Any known fix for jupyterlab?

@awaelchli
Copy link
Member

awaelchli commented Nov 23, 2020

It's because of the stacking. progress bar stacking has never worked well in jupyter and google colab. As far as we know, it's a tqdm issue. Try running a stacked tqdm progress bar (without Lightning) in a Jupyter and you will see the same.

@jzazo
Copy link

jzazo commented Nov 23, 2020

In method init_validation_tqdm in line 289 in file pytorch_lightning/callbacks/progress.py, there is leave=False. Shouldn't it be leave=True? It is True in the train and test init tqdm methods.

Got the idea from here.

@awaelchli
Copy link
Member

If we set it to leave=True, it will stay and fill up the terminal. But we want it to go away once validation is over because it's only a temporary bar that runs in parallel with the main bar. The main bar should stay always because it shows the epoch counter for the whole training.

Maybe I'm missing something. Appreciate you trying to look for the fix.

@jzazo
Copy link

jzazo commented Nov 24, 2020

I ran the following code to test if the setting leave=True solved the problem (but it didn't):

from tqdm.auto import tqdm
import sys

class LitProgressBar(ProgressBar):

    def init_validation_tqdm(self):
        """ Override this to customize the tqdm bar for validation. """
        bar = tqdm(
            desc='Validating',
            position=(2 * self.process_position + 1),
            disable=self.is_disabled,
            leave=True,
            dynamic_ncols=True,
            file=sys.stdout
        )
        return bar

I then ran my model with the custom callback, and after a few steps (~50% epoch) the screen was packed again with multiple printed lines :(

As a temporary fix I will disable the validation progress bar with a custom callback, at least when running with Jupyter. Thanks for the help!

@laiming997
Copy link

I just have a problem about the rewrite of the tqdm progressbar, I want to keep the train and val progressbar ,so I set both of them the leave==True. But when I print some information about the result in the val_epoch_end, it rewrite the progressbar like it :
Training Epoch 0 / 1000: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:08<00:00, 5.40it/s, loss=1.196, v_num=77 train_acc val_acc█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00, 6.87it/s]
0 0.562356 0.389292
Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:38<00:00, 3.18s/it]
Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:38<00:00, 3.19s/it]
Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:38<00:00, 3.19s/it]
Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:38<00:00, 3.21s/it]
Validating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:39<00:00, 3.27s/it]
train_acc val_acc
0 0.562356 0.389292████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00, 6.78it/s]
1 0.716502 0.570754
Training Epoch 2 / 1000: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:08<00:00, 5.55it/s, loss=1.047, v_num=77 train_acc val_acc█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00, 7.15it/s]
0 0.562356 0.389292
1 0.716502 0.570754
2 0.733585 0.594477
Training Epoch 3 / 1000: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:08<00:00, 5.51it/s, loss=1.075, v_num=77 train_acc val_acc
0 0.562356 0.389292
1 0.716502 0.570754████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00, 7.16it/s]
2 0.733585 0.594477
3 0.728909 0.285985
Training Epoch 4 / 1000: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:08<00:00, 5.46it/s, loss=1.101, v_num=77 train_acc val_acc
0 0.562356 0.389292
1 0.716502 0.570754
2 0.733585 0.594477████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:02<00:00, 7.16it/s]
3 0.728909 0.285985
4 0.725164 0.556583

@awaelchli
Copy link
Member

I don't understand exactly what you are trying to achieve.

  1. setting leave=True will do what is says, it leaves the progress bar there after it is completed and will not restart it. That's why you see the next epoch print a new bar.

  2. You are printing other data to stdout, then it is normal that the progress bar gets repeated.

@ce52supi
Copy link

I've made a custom progress bar as a callback and it works well for my needs. Not sure it will fit everyone's needs.

from tqdm.auto import tqdm

import torch
from pytorch_lightning.callbacks import Callback


class ProgressBar(Callback):
    """Global progress bar.
    TODO: add progress bar for training, validation and testing loop.
    """

    def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
        super().__init__()

        self.global_progress = global_progress
        self.global_desc = "Epoch: {epoch}/{max_epoch}"
        self.leave_global_progress = leave_global_progress
        self.global_pb = None

    def on_fit_start(self, trainer, pl_module):
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)

        self.global_pb = tqdm(
            desc=desc,
            total=trainer.max_epochs,
            initial=trainer.current_epoch,
            leave=self.leave_global_progress,
            disable=not self.global_progress,
        )

    def on_fit_end(self, trainer, pl_module):
        self.global_pb.close()
        self.global_pb = None

    def on_epoch_end(self, trainer, pl_module):

        # Set description
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
        self.global_pb.set_description(desc)

        # Set logs and metrics
        logs = pl_module.logs
        for k, v in logs.items():
            if isinstance(v, torch.Tensor):
                logs[k] = v.squeeze().item()
        self.global_pb.set_postfix(logs)

        # Update progress
        self.global_pb.update(1)

Only a global progress bar is implemented at the moment.

I could make a PR but some people might prefer the original one so I don't know if it's worth it.

Where can we use call this class ProgressBar. Is it called in pl.Trainer()?

@awaelchli
Copy link
Member

It's a callback, so you can add it to the callback list in the Trainer: Trainer(callbacks=[ProgressBar()])

@Jadiker
Copy link

Jadiker commented Jan 29, 2023

Another nice addition would be a global progress bar to have an ETA for the end of the whole training. Maybe a more general way to address this issue is to abstract the use of the progress bar in Trainer (with a callback system for example), so people can extend and tweak progress bar usage as they need.

I'm new to PyTorch Lightning and still would like the global ETA functionality. I've read through this thread and this one and it's still unclear to me how to get an ETA for how long training will take.

I've tried copying the code above for a global ETA, but right now I'm getting the error AttributeError: 'LitModel' object has no attribute 'logs', where LitModel is my pytorch lightning model.

What do I need to do in order to get the global ETA functionality?

EDIT: Nevermind, I just removed the # Set logs and metrics part of the code and now it works fine. Thank you!

@Jadiker
Copy link

Jadiker commented Jan 30, 2023

Here's an updated version of the code that should work for the newer callback functions. It also includes a lower-level training progress bar in addition to the global progress bar:

from tqdm.auto import tqdm
from pytorch_lightning.callbacks import Callback

class GlobalProgressBar(Callback):
    """Global progress bar.
    Originally from: https://github.com/Lightning-AI/lightning/issues/765
    """

    def __init__(self, global_progress: bool = True, leave_global_progress: bool = True):
        super().__init__()

        self.global_progress = global_progress
        self.global_desc = "Epoch: {epoch}/{max_epoch}"
        self.leave_global_progress = leave_global_progress
        self.global_pb = None
        self.step_pb = None

    def on_fit_start(self, trainer, pl_module):
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)

        self.global_pb = tqdm(
            desc=desc,
            total=trainer.max_epochs,
            initial=trainer.current_epoch,
            leave=self.leave_global_progress,
            disable=not self.global_progress,
        )

    def on_train_epoch_start(self, trainer, pl_module):
        self.step_pb = tqdm(
            desc="Training",
            total=len(trainer.train_dataloader),
            leave=False,
        )
        
    def on_train_epoch_end(self, trainer, pl_module):
        self.step_pb.close()
        self.step_pb = None

        # Set description
        desc = self.global_desc.format(epoch=trainer.current_epoch + 1, max_epoch=trainer.max_epochs)
        self.global_pb.set_description(desc)

        # # Set logs and metrics
        # logs = pl_module.logs
        # for k, v in logs.items():
        #     if isinstance(v, torch.Tensor):
        #         logs[k] = v.squeeze().item()
        # self.global_pb.set_postfix(logs)

        # Update progress
        self.global_pb.update(1)
        
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        self.step_pb.update(1)
        
    def on_fit_end(self, trainer, pl_module):
        self.global_pb.close()
        self.global_pb = None

To use, do

trainer = pl.Trainer(
    enable_progress_bar=False,
    callbacks=[GlobalProgressBar()]
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests