From 08e5af76aaba590ed79eb50b760aa29f2f958896 Mon Sep 17 00:00:00 2001 From: Louis Dupont Date: Sun, 22 Oct 2023 14:12:47 +0200 Subject: [PATCH 1/2] add docs --- documentation/source/Checkpoints.md | 3 + documentation/source/experiment_management.md | 76 +++++++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 documentation/source/experiment_management.md diff --git a/documentation/source/Checkpoints.md b/documentation/source/Checkpoints.md index 4d1eb149e8..89a8153c4b 100644 --- a/documentation/source/Checkpoints.md +++ b/documentation/source/Checkpoints.md @@ -1,5 +1,8 @@ # Model Checkpoints +*If you are not familiar on how experiments are managed, you can check [this tutorial](experiment_management.md)* + + The first question that arises is: what is a checkpoint? From the [Pytorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) documentation: diff --git a/documentation/source/experiment_management.md b/documentation/source/experiment_management.md new file mode 100644 index 0000000000..5aff2c4900 --- /dev/null +++ b/documentation/source/experiment_management.md @@ -0,0 +1,76 @@ +# Experiment Management + +## Outline +1. [Core Concepts](#core-concepts) + - [Checkpoint Root Directory](#checkpoint-root-directory-ckpt_root_dir) + - [Experiments](#experiments-experiment_name) + - [Runs](#runs-run_id) +2. [File Structure of Experiments](#file-structure-of-experiments) +3. [Utilities for Experiment Management](#utilities) + - [Get the Absolute Path of a Run Directory](#a-get-the-absolute-path-of-a-run-directory) + - [Retrieve the Latest Run ID](#b-get-the-latest-run-id) + +## Core Concepts + +### Checkpoint Root Directory (`ckpt_root_dir`) +- The main directory where all experiment outputs are housed. + +### Experiments (`experiment_name`) +- Symbolizes a distinct training recipe or configuration. +- Alter the `experiment_name` for transparency when updating your training recipe. +- Each training under the same `experiment_name` has its individual `run` directory, ensuring no overwrites. + +### Runs (`run_id`) +- Every individual training session is termed as a `run`. +- A unique `run_id` is generated for every training, regardless of identical parameters. +- Different trainings under the same `experiment_name` maintain distinct logs and checkpoints, courtesy of their separate run directories. + +## File Structure of Experiments + +``` + +│ +├── +│ │ +│ ├─── +│ │ ├─ ckpt_best.pth # Best performance during validation +│ │ ├─ ckpt_latest.pth # End of the most recent epoch +│ │ ├─ average_model.pth # Averaged over specified epochs +│ │ ├─ ckpt_epoch_*.pth # Checkpoints from certain epochs (e.g., epoch 10, 15) +│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts +│ │ └─ log_.txt # Trainer logs of that particular run +│ │ +│ └─── +│ └─ ... +│ +└─── + │ + ├─── + │ └─ ... + │ + └─── + └─ ... +``` + +## Utilities + +#### A. Get the absolute path of a run directory +Manually navigate using `//` or utilize the following programmatic approach: +```python +from super_gradients.common.environment.checkpoints_dir_utils import get_checkpoints_dir_path + +checkpoints_dir_path = get_checkpoints_dir_path(experiment_name="", run_id="") +``` + +#### B. Get the latest run id + +```python +from super_gradients.common.environment.checkpoints_dir_utils import get_latest_run_id + +run_id = get_latest_run_id(experiment_name="") +``` +Combine with the above utility to fetch the path of the latest run directory. + +**Next Steps**: +- Dive into the [checkpoints tutorial](Checkpoints.md) to grasp the essence of checkpoints, enabling you to resume trainings or access checkpoints from prior runs. +- The [logs tutorial](logs.md) focuses on the log files stored in your run directories, offering insights into the training progression. From e55b60a23d41ee64925ac137932cdb194579331c Mon Sep 17 00:00:00 2001 From: Louis Dupont Date: Sun, 22 Oct 2023 14:13:49 +0200 Subject: [PATCH 2/2] add mkdocs --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index 01d4c98894..5333c9cf5a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -30,6 +30,7 @@ nav: - Training: ./documentation/source/Recipes_Training.md - Factories: ./documentation/source/Recipes_Factories.md - Custom Recipes: ./documentation/source/Recipes_Custom.md + - Experiment Management: ./documentation/source/experiment_management.md - Checkpoints: ./documentation/source/Checkpoints.md - Docker: ./documentation/source/SGDocker.md - Output Adapter: ./documentation/source/DetectionOutputAdapter.md