Feature/sg 708 time units #1181

BloodAxe · 2023-06-16T09:17:33Z

This PR adds the possibility to log scalar values with explicit time units associated with it.
It enables us to log loss value per step, for instance. The motivation of having this feature is three-fold:

It allow us to log losses per step (Which can be quite useful to spot instabilities and high loss spikes during training)
It also allow us to use TimerCallback that measures and log batch time (including forward & backward passes)
Some loggers (W&B in particular) has a technical limitation that force you to provice monotonicaly increasing time steps when logging. E.g you cannot log losses for batch 0,1,2,3,4...90 and then log averaged loss per epoch=0. With this PR is is now possible.

torch.compile PR also depends on this PR (We are using TimerCallback to measure training speedup).

drop_last: True

…to feature/SG-686-torch-compile

…to feature/SG-686-torch-compile # Conflicts: # src/super_gradients/training/utils/callbacks/callbacks.py

…to feature/SG-686-torch-compile

shaydeci

LGTM

* Added torch.compile support * Timer * Timer * Targets fpr torch compile * Disable DDP * train_dataloader_params: drop_last: True * Fix Timer callback to log events per global step * load_backbone: False * Detection models * Lower LR * Added notes * Added per-epoch timers * Fix wrong nesting of drop_last * Fixes to logging * Log values per step/epoch explictly * Fixes to logging * Fixes to logging * Increase num epochs * Update numbers * Added epoch_total_time_sec * cityscapes_stdc_seg50 * load_backbone: False * imagenet_regnetY * imagenet_regnetY * cityscapes_stdc_seg50 with different compilation modes * cityscapes_stdc_seg50 * cityscapes_ddrnet * Update makefile targets * Ensure we log only on master * Add sync point to ensure we've compiled model on all nodes before going further * Reduce bs * Adding makefile targets * Yolo Nas configs * Yolo Nas configs * Add timer * Add timer * segmentation_compile_tests * segmentation_compile_tests * segmentation_compile_tests * Call to torch.compile after we set up DDP * cityscapes_ddrnet_test * Omit to(device) after converting model to syncbn * Change default torch_compile_mode to reduce-overhead * Update makeifle * segmentation_compile_tests * Update makeifle * Filling table * Update makeifle * Filling table * Filling table * Update makeifle * Update makeifle * Update makeifle * Adding time units * Yolo NAS numbers * Yolo NAS numbers * Add import of TimerCallback * Fixed the potential crash if TimerCallback used for evaluate_from_recipe * Fix missing inheritance for GlobalBatchStepNumber

BloodAxe added 30 commits February 24, 2023 14:11

Added torch.compile support

c7861b8

Timer

7f04468

Timer

b6a70fc

Targets fpr torch compile

ea0ff39

Disable DDP

9665332

train_dataloader_params:

eedd4ac

drop_last: True

Fix Timer callback to log events per global step

67b4cd1

load_backbone: False

a9af29c

Detection models

116b0a0

Lower LR

81033d6

Added notes

fe86a85

Added per-epoch timers

f151b17

Merge remote-tracking branch 'origin/feature/SG-686-torch-compile' in…

bd20e04

…to feature/SG-686-torch-compile

Fix wrong nesting of drop_last

1c54daf

Merge branch 'master' into feature/SG-686-torch-compile

a5b5dd8

Fixes to logging

3fe6c78

Log values per step/epoch explictly

fc35853

Merge remote-tracking branch 'origin/feature/SG-686-torch-compile' in…

c8096cf

…to feature/SG-686-torch-compile # Conflicts: # src/super_gradients/training/utils/callbacks/callbacks.py

Fixes to logging

3334cdd

Merge remote-tracking branch 'origin/feature/SG-686-torch-compile' in…

8a178ce

…to feature/SG-686-torch-compile

Fixes to logging

75ff37e

Increase num epochs

e586b1f

Update numbers

7126094

Added epoch_total_time_sec

84455b2

cityscapes_stdc_seg50

6453c4b

load_backbone: False

527ed10

imagenet_regnetY

7f491e3

imagenet_regnetY

e13cba0

cityscapes_stdc_seg50 with different compilation modes

440d9eb

cityscapes_stdc_seg50

b5224ab

BloodAxe added 19 commits June 15, 2023 14:41

Omit to(device) after converting model to syncbn

8a0e09f

Change default torch_compile_mode to reduce-overhead

c98eef6

Update makeifle

d840675

segmentation_compile_tests

800e11c

Update makeifle

fd94f80

Filling table

3c46eef

Merge remote-tracking branch 'origin/feature/SG-686-torch-compile' in…

f3f24c9

…to feature/SG-686-torch-compile

Update makeifle

76f1d7c

Filling table

83e33cd

Merge remote-tracking branch 'origin/feature/SG-686-torch-compile' in…

d053c38

…to feature/SG-686-torch-compile

Filling table

83cea64

Update makeifle

fe73bf2

Update makeifle

723d8a0

Update makeifle

6762ff3

Adding time units

d6666db

Yolo NAS numbers

dd6ba51

Merge branch 'master' into feature/SG-686-torch-compile

624b1ae

Yolo NAS numbers

7188903

Adding TimerCallback and explicit TimeUnits

1226e25

BloodAxe requested a review from shaydeci as a code owner June 16, 2023 09:17

Add import of TimerCallback

eb028ca

BloodAxe requested review from ofrimasad and Louis-Dupont as code owners June 16, 2023 09:17

BloodAxe added 2 commits June 16, 2023 12:19

Fixed the potential crash if TimerCallback used for evaluate_from_recipe

99f3a9c

Fix missing inheritance for GlobalBatchStepNumber

49de27c

shaydeci approved these changes Jun 18, 2023

View reviewed changes

Merge branch 'master' into feature/SG-708-time-units

10e9d85

BloodAxe merged commit 42e3ecf into master Jun 18, 2023
1 check passed

BloodAxe deleted the feature/SG-708-time-units branch June 18, 2023 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/sg 708 time units #1181

Feature/sg 708 time units #1181

BloodAxe commented Jun 16, 2023

shaydeci left a comment

Feature/sg 708 time units #1181

Feature/sg 708 time units #1181

Conversation

BloodAxe commented Jun 16, 2023

shaydeci left a comment

Choose a reason for hiding this comment