Added metrics logging to checkpoint and separate yaml file #1562

hakuryuu96 · 2023-10-23T10:54:08Z

Issue description

@BloodAxe said SG does not support saving metrics, environment, and train config to the checkpoint.

After getting more familiar with a codebase, I realized the task should be split into several parts / PRs.

PR description

This PR proposes a way to add metrics to a checkpoint.

Metrics from train and validation steps are saved in all .pth files.
Additionally, the metrics are stored in several .yml files (best, latest, and epoch if specified in config). The file structure is the following:

metrics:
  train:
   <metricClass1>: <float> 
   ...
   <metricClassn>: <float> 
  valid:
   <metricClass1>: <float> 
   ...
   <metricClassn>: <float> 
tracked_metric_name: <metricClassx>

Primarily, this is done for easier access to experiment metrics, so the user doesn't need to run TB or parse the checkpoint to access them.

Some ideas and notes

The metric setup could be done in a bit more flexible way.
Let's suppose the user needs to train a custom model with several classification heads and track different metrics. For now, all the metrics are gathered in a MetricCollection and each metric uses the same output of a model. Moreover, if the user wants to track the accuracy of some of the outputs, he can't do this. Now metrics are named exactly as the class is called and are stored in the dictionary under this name. It will be rewritten into the dictionary and one metric will remain.

It would be great to have the possibility to specify which model output should be used for every metric (and maybe loss). It also could be configured somehow like

metrics: 
    __target__: CustomMetricDict
        metric_head1:
            __target__: MetrciClass1
                param1: ...
                paramn: ...
           model_prediction_key: <str>
           target_key: <str>
        ...
        metric_headn: ...

Mixing metrics and losses in a single dict (sg_trainer.py, get_train_loop_description_dict method, used as a return from evaluate)
When someone tries to access metric/loss values separately after executing the evaluate() method, some routine on splitting losses and metrics from single joined dict should be done. In this PR, saving metrics to the dict could be done much easier, if they were separated (for now the metrics are selected from the dictionary by their names and form a new dictionary, see sg_trainer.py/_save_checkpoint/lines 671, 679). Also, treating losses and metrics as separate structures adds some points to the code understanding, as they are logically different ;)
It looks like changing this will affect a bunch of places in the project, so if code owners respond to this problem, then it should become a separate PR :)
isinstance(metric, torch.Tensor)
There are some places in _save_checkpoint function where a type of metric value should be checked wrt its type (values returned in validation/train_results_dict). It looks like metrics there sometimes appear to have different types (torch.Tensor(dtype=float) or a usual float). It would be much more convenient to expect a single type of value there, and checks of this kind will disappear and make the code look cleaner 🪄
For now, this routine has been moved to a separate function.

…gned

src/super_gradients/common/sg_loggers/base_sg_logger.py

…ed in DPR

src/super_gradients/common/sg_loggers/base_sg_logger.py

src/super_gradients/training/sg_trainer/sg_trainer.py

tests/unit_tests/test_train_with_torch_scheduler.py

…float())

…6/super-gradients into feature/saving_metrics_to_yaml

hakuryuu96 · 2023-10-26T16:54:54Z

After a discussion with @BloodAxe we decided to leave only saving metrics to a checkpoint for this pull request. Another pull request will come with proposals for sg_loggers refactoring and saving some trainer internals to yaml.

src/super_gradients/training/sg_trainer/sg_trainer.py

BloodAxe

LGTM

BloodAxe

LGTM

Louis-Dupont

LGTM

hakuryuu96 added 4 commits October 23, 2023 10:52

Added metrics logging to checkpoint and separate yaml file

4b11551

Merge branch 'master' into feature/saving_metrics_to_yaml

58c96ca

Fixed variable name in docstr to add_yaml_summary

6b3999c

Merged changes in feature/saving_metrics_to_yaml and making commit si…

fb8ae7d

…gned

hakuryuu96 force-pushed the feature/saving_metrics_to_yaml branch from 6b3999c to fb8ae7d Compare October 23, 2023 11:01

BloodAxe reviewed Oct 23, 2023

View reviewed changes

src/super_gradients/common/sg_loggers/base_sg_logger.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 23, 2023

View reviewed changes

src/super_gradients/common/sg_loggers/base_sg_logger.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 23, 2023

View reviewed changes

src/super_gradients/common/sg_loggers/base_sg_logger.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 23, 2023

View reviewed changes

src/super_gradients/common/sg_loggers/base_sg_logger.py Outdated Show resolved Hide resolved

hakuryuu96 added 5 commits October 24, 2023 13:28

Added method to abstract sglogger class and fixed some places comment…

9446489

…ed in DPR

Added method to abstract sglogger class and fixed some places comment…

80dc4b9

…ed in DPR

Fixed metric saving and added/fixed some tests

0c2e066

Merge branch 'master' into feature/saving_metrics_to_yaml

53b5294

Merge branch 'master' into feature/saving_metrics_to_yaml

cff188e

hakuryuu96 marked this pull request as ready for review October 24, 2023 17:54

hakuryuu96 requested review from shaydeci, ofrimasad and Louis-Dupont as code owners October 24, 2023 17:54

hakuryuu96 requested a review from BloodAxe October 24, 2023 18:40

hakuryuu96 added 2 commits October 25, 2023 12:54

Merge branch 'master' into feature/saving_metrics_to_yaml

ca054c7

Merge branch 'master' into feature/saving_metrics_to_yaml

aea5860

BloodAxe reviewed Oct 25, 2023

View reviewed changes

src/super_gradients/common/sg_loggers/base_sg_logger.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 25, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 25, 2023

View reviewed changes

tests/unit_tests/test_train_with_torch_scheduler.py Outdated Show resolved Hide resolved

hakuryuu96 added 2 commits October 25, 2023 14:29

Changed casting function from __maybe_get_item_from_tensor to simple …

c60612d

…float())

Merge branch 'feature/saving_metrics_to_yaml' of github.com:hakuryuu9…

133b9f7

…6/super-gradients into feature/saving_metrics_to_yaml

hakuryuu96 requested a review from BloodAxe October 25, 2023 14:39

hakuryuu96 added 3 commits October 26, 2023 19:36

Merge branch 'master' into feature/saving_metrics_to_yaml

2bf40e2

Left only metrics saved to checkpoint

534fbfa

Removed test for yaml files

34ab3fd

BloodAxe reviewed Oct 26, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 26, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

BloodAxe reviewed Oct 26, 2023

View reviewed changes

src/super_gradients/training/sg_trainer/sg_trainer.py Outdated Show resolved Hide resolved

Fixed some place in code according to comments in PR

256dbae

hakuryuu96 requested a review from BloodAxe October 26, 2023 19:32

hakuryuu96 added 2 commits October 26, 2023 19:39

Changed float metric value back to int in test of schedulers :(

cb84783

Merge branch 'master' into feature/saving_metrics_to_yaml

a96c2ab

BloodAxe previously approved these changes Oct 28, 2023

View reviewed changes

BloodAxe and others added 2 commits October 29, 2023 17:07

Merge branch 'master' into feature/saving_metrics_to_yaml

e7b8d83

Merge branch 'master' into feature/saving_metrics_to_yaml

2ce24a7

hakuryuu96 dismissed BloodAxe’s stale review via 2ce24a7 October 30, 2023 10:12

hakuryuu96 added 2 commits October 30, 2023 11:26

Fixed linters (trailing spaces)

412b62d

Merge branch 'master' into feature/saving_metrics_to_yaml

6168a75

hakuryuu96 requested a review from BloodAxe October 30, 2023 11:59

BloodAxe approved these changes Oct 30, 2023

View reviewed changes

Louis-Dupont approved these changes Oct 30, 2023

View reviewed changes

Merge branch 'master' into feature/saving_metrics_to_yaml

0a02eac

BloodAxe merged commit e91be8f into Deci-AI:master Oct 31, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added metrics logging to checkpoint and separate yaml file #1562

Added metrics logging to checkpoint and separate yaml file #1562

hakuryuu96 commented Oct 23, 2023 •

edited

Loading

hakuryuu96 commented Oct 26, 2023 •

edited

Loading

BloodAxe left a comment

BloodAxe left a comment

Louis-Dupont left a comment

Added metrics logging to checkpoint and separate yaml file #1562

Added metrics logging to checkpoint and separate yaml file #1562

Conversation

hakuryuu96 commented Oct 23, 2023 • edited Loading

Issue description

PR description

Some ideas and notes

hakuryuu96 commented Oct 26, 2023 • edited Loading

BloodAxe left a comment

Choose a reason for hiding this comment

BloodAxe left a comment

Choose a reason for hiding this comment

Louis-Dupont left a comment

Choose a reason for hiding this comment

hakuryuu96 commented Oct 23, 2023 •

edited

Loading

hakuryuu96 commented Oct 26, 2023 •

edited

Loading