Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sg 708 time units #1181

Merged
merged 69 commits into from
Jun 18, 2023
Merged

Feature/sg 708 time units #1181

merged 69 commits into from
Jun 18, 2023

Conversation

BloodAxe
Copy link
Collaborator

This PR adds the possibility to log scalar values with explicit time units associated with it.
It enables us to log loss value per step, for instance. The motivation of having this feature is three-fold:

  1. It allow us to log losses per step (Which can be quite useful to spot instabilities and high loss spikes during training)
  2. It also allow us to use TimerCallback that measures and log batch time (including forward & backward passes)
  3. Some loggers (W&B in particular) has a technical limitation that force you to provice monotonicaly increasing time steps when logging. E.g you cannot log losses for batch 0,1,2,3,4...90 and then log averaged loss per epoch=0. With this PR is is now possible.

torch.compile PR also depends on this PR (We are using TimerCallback to measure training speedup).

@BloodAxe BloodAxe requested a review from shaydeci as a code owner June 16, 2023 09:17
Copy link
Collaborator

@shaydeci shaydeci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BloodAxe BloodAxe merged commit 42e3ecf into master Jun 18, 2023
1 check passed
@BloodAxe BloodAxe deleted the feature/SG-708-time-units branch June 18, 2023 20:32
LHBuilder pushed a commit to LHBuilder/YOLO-NAS that referenced this pull request Jun 25, 2023
* Added torch.compile support

* Timer

* Timer

* Targets fpr torch compile

* Disable DDP

* train_dataloader_params:
  drop_last: True

* Fix Timer callback to log events per global step

* load_backbone: False

* Detection models

* Lower LR

* Added notes

* Added per-epoch timers

* Fix wrong nesting of drop_last

* Fixes to logging

* Log values per step/epoch explictly

* Fixes to logging

* Fixes to logging

* Increase num epochs

* Update numbers

* Added epoch_total_time_sec

* cityscapes_stdc_seg50

* load_backbone: False

* imagenet_regnetY

* imagenet_regnetY

* cityscapes_stdc_seg50 with different compilation modes

* cityscapes_stdc_seg50

* cityscapes_ddrnet

* Update makefile targets

* Ensure we log only on master

* Add sync point to ensure we've compiled model on all nodes before going further

* Reduce bs

* Adding makefile targets

* Yolo Nas configs

* Yolo Nas configs

* Add timer

* Add timer

* segmentation_compile_tests

* segmentation_compile_tests

* segmentation_compile_tests

* Call to torch.compile after we set up DDP

* cityscapes_ddrnet_test

* Omit to(device) after converting model to syncbn

* Change default torch_compile_mode to reduce-overhead

* Update makeifle

* segmentation_compile_tests

* Update makeifle

* Filling table

* Update makeifle

* Filling table

* Filling table

* Update makeifle

* Update makeifle

* Update makeifle

* Adding time units

* Yolo NAS numbers

* Yolo NAS numbers

* Add import of TimerCallback

* Fixed the potential crash if TimerCallback used for evaluate_from_recipe

* Fix missing inheritance for GlobalBatchStepNumber
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants