Skip to content

Commit

Permalink
Feature/sg 1053 add explanations on experiment managing (#1559)
Browse files Browse the repository at this point in the history
* add docs

* add mkdocs

---------

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
  • Loading branch information
Louis-Dupont and BloodAxe committed Oct 26, 2023
1 parent e5d5ecd commit 4d20655
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 0 deletions.
3 changes: 3 additions & 0 deletions documentation/source/Checkpoints.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Model Checkpoints

*If you are not familiar on how experiments are managed, you can check [this tutorial](experiment_management.md)*


The first question that arises is: what is a checkpoint?

From the [Pytorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) documentation:
Expand Down
76 changes: 76 additions & 0 deletions documentation/source/experiment_management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Experiment Management

## Outline
1. [Core Concepts](#core-concepts)
- [Checkpoint Root Directory](#checkpoint-root-directory-ckpt_root_dir)
- [Experiments](#experiments-experiment_name)
- [Runs](#runs-run_id)
2. [File Structure of Experiments](#file-structure-of-experiments)
3. [Utilities for Experiment Management](#utilities)
- [Get the Absolute Path of a Run Directory](#a-get-the-absolute-path-of-a-run-directory)
- [Retrieve the Latest Run ID](#b-get-the-latest-run-id)

## Core Concepts

### Checkpoint Root Directory (`ckpt_root_dir`)
- The main directory where all experiment outputs are housed.

### Experiments (`experiment_name`)
- Symbolizes a distinct training recipe or configuration.
- Alter the `experiment_name` for transparency when updating your training recipe.
- Each training under the same `experiment_name` has its individual `run` directory, ensuring no overwrites.

### Runs (`run_id`)
- Every individual training session is termed as a `run`.
- A unique `run_id` is generated for every training, regardless of identical parameters.
- Different trainings under the same `experiment_name` maintain distinct logs and checkpoints, courtesy of their separate run directories.

## File Structure of Experiments

```
<ckpt_root_dir>
├── <experiment_name>
│ │
│ ├─── <run_dir>
│ │ ├─ ckpt_best.pth # Best performance during validation
│ │ ├─ ckpt_latest.pth # End of the most recent epoch
│ │ ├─ average_model.pth # Averaged over specified epochs
│ │ ├─ ckpt_epoch_*.pth # Checkpoints from certain epochs (e.g., epoch 10, 15)
│ │ ├─ events.out.tfevents.* # Tensorflow run artifacts
│ │ └─ log_<timestamp>.txt # Trainer logs of that particular run
│ │
│ └─── <other_run_dir>
│ └─ ...
└─── <other_experiment_name>
├─── <run_dir>
│ └─ ...
└─── <another_run_dir>
└─ ...
```

## Utilities

#### A. Get the absolute path of a run directory
Manually navigate using `<ckpt_root_dir>/<experiment_name>/<run_dir>` or utilize the following programmatic approach:
```python
from super_gradients.common.environment.checkpoints_dir_utils import get_checkpoints_dir_path

checkpoints_dir_path = get_checkpoints_dir_path(experiment_name="<experiment_name>", run_id="<run_id>")
```

#### B. Get the latest run id

```python
from super_gradients.common.environment.checkpoints_dir_utils import get_latest_run_id

run_id = get_latest_run_id(experiment_name="<experiment_name>")
```
Combine with the above utility to fetch the path of the latest run directory.

**Next Steps**:
- Dive into the [checkpoints tutorial](Checkpoints.md) to grasp the essence of checkpoints, enabling you to resume trainings or access checkpoints from prior runs.
- The [logs tutorial](logs.md) focuses on the log files stored in your run directories, offering insights into the training progression.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ nav:
- Training: ./documentation/source/Recipes_Training.md
- Factories: ./documentation/source/Recipes_Factories.md
- Custom Recipes: ./documentation/source/Recipes_Custom.md
- Experiment Management: ./documentation/source/experiment_management.md
- Checkpoints: ./documentation/source/Checkpoints.md
- Docker: ./documentation/source/SGDocker.md
- Output Adapter: ./documentation/source/DetectionOutputAdapter.md
Expand Down

0 comments on commit 4d20655

Please sign in to comment.