[Improvement] max_batches support to training log and tqdm progress bar. #1554

hakuryuu96 · 2023-10-19T15:23:23Z

Issue description

As @BloodAxe described, if training_hyperparams.max_train/valid_batches are redefined in CLI, tqdm and training log do not take changes into account and continue showing the full length of the dataloader. E.g. when user executes something like this

train_params = {
    "max_epochs": 300,
    "phase_callbacks": phase_callbacks,
    "initial_lr": lr,
    "loss": loss_fn,
    "optimizer": optimizer,
    "train_metrics_list": [Accuracy(), Top5()],
    "valid_metrics_list": [Accuracy(), Top5()],
    "metric_to_watch": "Accuracy",
    "greater_metric_to_watch_is_better": True,
    "lr_scheduler_step_type": "epoch",
    "max_train_batches": 24,
    "max_valid_batches": 24,
}

trainer.train(model=net, training_params=train_params, train_loader=train_loader, valid_loader=valid_loader)

the resulting logs and progress bar are the following:

[2023-10-19 13:30:41] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (2 available on the machine)
    - Full dataset size:            50000      (len(train_set))
    - Batch size per GPU:           16         (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             16         (num_gpus * batch_size)
    - Effective Batch size:         16         (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         3125         (len(train_loader))
    - Gradient updates per epoch:   3125         (len(train_loader) / batch_accumulate)

[2023-10-19 13:30:41] WARNING - sg_trainer_utils.py - max_train_batch is set to 24. This limits the number of iterations per epoch and gradient updates per epoch.
[2023-10-19 13:30:41] INFO - sg_trainer.py - Started training for 300 epochs (0/299)

Train epoch 0: 1%|█                        | 24/3125 [00:02<00:00,  9.22it/s, Accuracy=0.0729, CrossEntropyLoss=2.53, Top5=0.508, gpu_mem=0.231]
Validating: 1%|█                      | 24/3125 [00:00<00:00, 126.12it/s]
2023-10-19 13:30:44] INFO - base_sg_logger.py - Checkpoint saved in /home/phil/deci/super-gradients/checkpoints/Cifar10_external_objects_example/RUN_20231019_133041_486426/ckpt_best.pth
[2023-10-19 13:30:44] INFO - sg_trainer.py - Best checkpoint overriden: validation Accuracy: 0.1002604141831398
===========================================================
SUMMARY OF EPOCH 0
├── Train
│   ├── Crossentropyloss = 2.5337
│   ├── Accuracy = 0.0729
│   └── Top5 = 0.5078
└── Validation
    ├── Crossentropyloss = 2.3486
    ├── Accuracy = 0.1003
    └── Top5 = 0.4596

===========================================================

PR description

This PR addresses the issue above and proposes some improvements. Briefly:

Logs show the actual number of used elements in dataloader
1.5. Additionally, logs are warning user that the max_batches parameter was set.
Progress bar shows the actual number of steps it will take to finish the epoch

E.g. if the max_train/valid_batches parameter is specified:

[2023-10-19 13:30:41] INFO - sg_trainer_utils.py - TRAINING PARAMETERS:
    - Mode:                         Single GPU
    - Number of GPUs:               1          (2 available on the machine)
    - Full dataset size:            50000      (len(train_set))
    - Batch size per GPU:           16         (batch_size)
    - Batch Accumulate:             1          (batch_accumulate)
    - Total batch size:             16         (num_gpus * batch_size)
    - Effective Batch size:         16         (num_gpus * batch_size * batch_accumulate)
    - Iterations per epoch:         24         (len(train_loader) OR max_train_batches)
    - Gradient updates per epoch:   24         (len(train_loader) OR max_train_batches / batch_accumulate)

[2023-10-19 13:30:41] WARNING - sg_trainer_utils.py - max_train_batch is set to 24. This limits the number of iterations per epoch and gradient updates per epoch.
[2023-10-19 13:30:41] INFO - sg_trainer.py - Started training for 300 epochs (0/299)

Train epoch 0: 100%|██████████| 24/24 [00:02<00:00,  9.22it/s, Accuracy=0.0729, CrossEntropyLoss=2.53, Top5=0.508, gpu_mem=0.231]
Validating: 100%|██████████| 24/24 [00:00<00:00, 126.12it/s]
[2023-10-19 13:30:44] INFO - base_sg_logger.py - Checkpoint saved in /home/phil/deci/super-gradients/checkpoints/Cifar10_external_objects_example/RUN_20231019_133041_486426/ckpt_best.pth
[2023-10-19 13:30:44] INFO - sg_trainer.py - Best checkpoint overriden: validation Accuracy: 0.1002604141831398

If not, the logs behave similarly to previous versions.

Some ideas

IMO it should be cool to consider logging the whole set of training parameters before the run. For me as a user, it would be nice to double-check all the settings I've made somewhere in the project (e.g. if I use hydra and take SG Trainer class to my pipeline) and to be sure things go smoothly :)
For example, the user should see the following info:

dataset (number of classes, class names, etc)
dataloader (batch_size, num_workers, etc)
model (model name, number of trainable parameters, etc)
optimization info (optimizer, lr, wd, losses, etc)
training info (number of epochs, number of gradient updates, EMA, SyncBN usage, etc)

…er) of max_batches)

src/super_gradients/training/sg_trainer/sg_trainer.py

BloodAxe

LGTM

Louis-Dupont

LGTM

…ar. (#1554) * Added max_batches support to training log and tqdm progress bar. * Added changing string in accordance which parameter is used (len(loader) of max_batches) * Replaced stopping condition for the epoch with a smaller one (cherry picked from commit 749a9c7)

* [Improvement] max_batches support to training log and tqdm progress bar. (#1554) * Added max_batches support to training log and tqdm progress bar. * Added changing string in accordance which parameter is used (len(loader) of max_batches) * Replaced stopping condition for the epoch with a smaller one (cherry picked from commit 749a9c7) * fix (#1558) Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> (cherry picked from commit 8a1d255) * fix (#1564) (cherry picked from commit 24798b0) * Bugfix of model.export() to work correct with bs>1 (#1551) (cherry picked from commit 0515496) * Fixed incorrect automatic variable used (#1565) $@ is the name of the target being generated, and $^ are the dependencies Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com> (cherry picked from commit 43f8bea) * fix typo in class documentation (#1548) Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com> (cherry picked from commit ec21383) * Feature/sg 1198 mixed precision automatically changed with warning (#1567) * fix * work with tmpdir * minor change of comment * improve device_config (cherry picked from commit 34fda6c) * Fixed issue with torch 1.12 where _scale_fn_ref is missing in CyclicLR (#1575) (cherry picked from commit 23b4f7a) * Fixed issue with torch 1.12 issue with arange not supporting fp16 for CPU device. (#1574) (cherry picked from commit 1f15c76) --------- Co-authored-by: hakuryuu96 <marchenkophilip@gmail.com> Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com> Co-authored-by: Alessandro Ros <aler9.dev@gmail.com>

Added max_batches support to training log and tqdm progress bar.

4b78ba0

hakuryuu96 requested review from shaydeci, ofrimasad, BloodAxe and Louis-Dupont as code owners October 19, 2023 15:23

Added changing string in accordance which parameter is used (len(load…

572776a

…er) of max_batches)

BloodAxe reviewed Oct 20, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 20, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

Replaced stopping condition for the epoch with a smaller one

723d047

hakuryuu96 requested a review from BloodAxe October 20, 2023 14:28

BloodAxe approved these changes Oct 21, 2023

View reviewed changes

Louis-Dupont approved these changes Oct 23, 2023

View reviewed changes

BloodAxe merged commit 749a9c7 into Deci-AI:master Oct 23, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] max_batches support to training log and tqdm progress bar. #1554

[Improvement] max_batches support to training log and tqdm progress bar. #1554

hakuryuu96 commented Oct 19, 2023

BloodAxe left a comment

Louis-Dupont left a comment

[Improvement] max_batches support to training log and tqdm progress bar. #1554

[Improvement] max_batches support to training log and tqdm progress bar. #1554

Conversation

hakuryuu96 commented Oct 19, 2023

Issue description

PR description

Some ideas

BloodAxe left a comment

Choose a reason for hiding this comment

Louis-Dupont left a comment

Choose a reason for hiding this comment