Fix checkpoint doc (#1445)

* wip * remove
Deci-AI · Sep 4, 2023 · 838398d · 838398d
1 parent 4c3fea4
commit 838398d
Show file tree

Hide file tree

Showing 4 changed files with 142 additions and 79 deletions.
diff --git a/documentation/source/Checkpoints.md b/documentation/source/Checkpoints.md
@@ -90,22 +90,28 @@ trainer.train(model=model, training_params=train_params, train_loader=train_data
 Then at the end of the training, our `ckpt_root_dir` contents will look similar to the following:
 
 ```
-my_checkpoints_folder
-├─── my_resnet18_training_experiment
-│   ├── RUN_20230802_131052_651906
-│   │     ├─ ckpt_best.pth                     # Model checkpoint on best epoch
-│   │     ├─ ckpt_latest.pth                   # Model checkpoint on last epoch
-│   │     ├─ average_model.pth                 # Model checkpoint averaged over epochs
-│   │     ├─ ckpt_epoch_10.pth                 # Model checkpoint of epoch 10
-│   │     ├─ ckpt_epoch_15.pth                 # Model checkpoint of epoch 15
-│   │     ├─ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
-│   │     └─ log_Aug02_13_10_52.txt            # Trainer logs of a specific run
+<ckpt_root_dir>
+│
+├── <experiment_name>
+│   │
+│   ├─── <run_dir>
+│   │     ├─ ckpt_best.pth                   # Best performance during validation
+│   │     ├─ ckpt_latest.pth                 # End of the most recent epoch
+│   │     ├─ average_model.pth               # Averaged over specified epochs
+│   │     ├─ ckpt_epoch_*.pth                # Checkpoints from specific epochs (like epoch 10, 15, etc.)
+│   │     ├─ events.out.tfevents.*           # Tensorflow run artifacts
+│   │     └─ log_<timestamp>.txt             # Trainer logs of the specific run
 │   │
-│   └─ RUN_20230803_121652_243212
+│   └─── <other_run_dir>
 │        └─ ...
 │
-└─── some_other_training_experiment_name
-        ...
+└─── <other_experiment_name>
+    │
+    ├─── <run_dir>
+    │     └─ ...
+    │
+    └─── <another_run_dir>
+          └─ ...
 ```
 
 Suppose we wish to load the weights from `ckpt_best.pth`. We can simply pass its path to the `checkpoint_path` argument in `models.get(...)`:
@@ -129,6 +135,7 @@ from super_gradients.training.utils.checkpoint_utils import load_checkpoint_to_m
 model = models.get(model_name=Models.RESNET18, num_classes=10)
 load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
 ```
+
 ### Extending the Functionality of PyTorch's `strict` Parameter in `load_state_dict()`
 
 When not familiar with PyTorch's `strict` parameter in `load_state_dict()`, please see [PyTorch's docs on this matter](https://pytorch.org/tutorials/beginner/saving_loading_models.html#id4) first.

diff --git a/documentation/source/Example_Classification.md b/documentation/source/Example_Classification.md
@@ -1,4 +1,4 @@
-# Training a classification model and transfer learning
+# Training a Classification Model and Transfer Learning
 
 In this example we will use SuperGradients to train from scratch a ResNet18 model on the CIFAR10 image classification 
 dataset. We will also fine-tune the same model via transfer learning with weights pre-trained on the ImageNet dataset.
@@ -14,41 +14,63 @@ pip install super-gradients
 
 ## 1. Experiment setup
 
-First, we will initialize our trainer, which is a SuperGradients Trainer object.
+First, we will initialize the `Trainer`. It handles:
+- Model training
+- Evaluating test data
+- Making predictions
+- Saving and managing checkpoints
 
-```
+
+To initialize it, you need:
+
+- **Experiment Name:** A unique identifier for your training experiment.
+- **Checkpoint Root Directory (`ckpt_root_dir`):** The directory where checkpoints, logs, and tensorboards are saved. While optional, if unspecified, it assumes the presence of a 'checkpoints' directory in your project's root.
+
+```python
 from super_gradients import Trainer
-```
 
-The trainer is in charge of training the model, evaluating test data, making predictions, and saving checkpoints.
+experiment_name = "resnet18_cifar10_example"
+CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'
+
+trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
+```
 
-To initialize the trainer, an experiment name must be provided. We will also provide a checkpoints root directory via
-the `ckpt_root_dir` parameter. In this directory, all the experiment's logs, tensorboards, and checkpoints directories 
-will reside. This parameter is optional, and if not provided, it is assumed that a 'checkpoints' directory exists in 
-the project's path.
+### 2. Understanding the Checkpoint Structure
 
-A directory with the experiment's name will be created as a subdirectory of `ckpt_root_dir` as follows:
+Checkpoints are crucial for progressive training, debugging, and model deployment. SuperGradients organizes them in a structured manner. Here's what the directory hierarchy looks like under your specified `ckpt_root_dir`:
 
 ```
-ckpt_root_dir
-|─── experiment_name_1
-│       ckpt_best.pth                     # Model checkpoint on best epoch
-│       ckpt_latest.pth                   # Model checkpoint on last epoch
-│       average_model.pth                 # Model checkpoint averaged over epochs
-│       events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
-│       log_Aug07_11_52_48.txt            # Trainer logs of a specific run
-└─── experiment_name_2
-        ...
+<ckpt_root_dir>
+│
+├── <experiment_name>
+│   │
+│   ├─── <run_dir>
+│   │     ├─ ckpt_best.pth                   # Best performance during validation
+│   │     ├─ ckpt_latest.pth                 # End of the most recent epoch
+│   │     ├─ average_model.pth               # Averaged over specified epochs
+│   │     ├─ ckpt_epoch_*.pth                # Checkpoints from specific epochs (like epoch 10, 15, etc.)
+│   │     ├─ events.out.tfevents.*           # Tensorflow run artifacts
+│   │     └─ log_<timestamp>.txt             # Trainer logs of the specific run
+│   │
+│   └─── <other_run_dir>
+│        └─ ...
+│
+└─── <other_experiment_name>
+    │
+    ├─── <run_dir>
+    │     └─ ...
+    │
+    └─── <another_run_dir>
+          └─ ...
 ```
 
-We initialize the trainer as follows:
+In this structure:
 
-```
-experiment_name = "resnet18_cifar10_example"
-CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'
+- `ckpt_best.pth`: Saved whenever there's an improvement in the specified validation metric.
+- `ckpt_latest.pth`: Updated at the end of every epoch.
+- `average_model.pth`: Averaged checkpoint, created if `average_best_models` parameter is set to `True`.
 
-trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
-```
+> For more information, check out the [dedicated page](.Checkpoints.md).
 
 ## 2. Dataset and dataloaders
 

diff --git a/documentation/source/Example_Training-an-external-model.md b/documentation/source/Example_Training-an-external-model.md
@@ -1,7 +1,9 @@
 # Training an external model
 
-In this example we will use SuperGradients to train a deep learning segmentation model to extract human portraits from  
-images, i.e., to remove the background from the image. We will show how SuperGradients allows seamless integration of 
+In this example we will use SuperGradients to train a deep learning segmentation model to extract human portraits from 
+images, i.e., to remove the background from the image. 
+
+We will show how SuperGradients allows seamless integration of 
 an external model, dataset, loss function, and metric into the training pipeline. 
 
 ## Quick installation
@@ -561,28 +563,18 @@ Our custom metric is now ready to use with our training pipeline.
 
 ## 5. Experiment configuration
 
-We now have the implementation of all external components we wish to incorporate into our training
-pipeline. Let's put it all together.
+### Trainer
+First, we will initialize the `Trainer`. It handles:
+- Model training
+- Evaluating test data
+- Making predictions
+- Saving and managing checkpoints
 
-First, we will initialize our trainer, which is in charge of training the model, evaluating test data, making 
-predictions, and saving checkpoints. To initialize the trainer, we provide an experiment name, and a checkpoints root 
-directory via the `ckpt_root_dir` parameter. In this directory, all of the experiment's logs, tensorboards, and 
-checkpoint directories will reside. A directory with the experiment's name will be created as a subdirectory of 
-`ckpt_root_dir` as follows:
 
-```
-ckpt_root_dir
-|─── experiment_name_1
-│       ckpt_best.pth                     # Model checkpoint on best epoch
-│       ckpt_latest.pth                   # Model checkpoint on last epoch
-│       average_model.pth                 # Model checkpoint averaged over epochs
-│       events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
-│       log_Aug07_11_52_48.txt            # Trainer logs of a specific run
-└─── experiment_name_2
-        ...
-```
+To initialize it, you need:
 
-We initialize the trainer as follows:
+- **Experiment Name:** A unique identifier for your training experiment.
+- **Checkpoint Root Directory (`ckpt_root_dir`):** The directory where checkpoints, logs, and tensorboards are saved. While optional, if unspecified, it assumes the presence of a 'checkpoints' directory in your project's root.
 
 ```python
 from super_gradients import Trainer
@@ -593,6 +585,45 @@ CHECKPOINT_DIR = '/path/to/checkpoints/root/dir'
 trainer = Trainer(experiment_name=experiment_name, ckpt_root_dir=CHECKPOINT_DIR)
 ```
 
+### Understanding the Checkpoint Structure
+
+Checkpoints are crucial for progressive training, debugging, and model deployment. SuperGradients organizes them in a structured manner. Here's what the directory hierarchy looks like under your specified `ckpt_root_dir`:
+
+```
+<ckpt_root_dir>
+│
+├── <experiment_name>
+│   │
+│   ├─── <run_dir>
+│   │     ├─ ckpt_best.pth                   # Best performance during validation
+│   │     ├─ ckpt_latest.pth                 # End of the most recent epoch
+│   │     ├─ average_model.pth               # Averaged over specified epochs
+│   │     ├─ ckpt_epoch_*.pth                # Checkpoints from specific epochs (like epoch 10, 15, etc.)
+│   │     ├─ events.out.tfevents.*           # Tensorflow run artifacts
+│   │     └─ log_<timestamp>.txt             # Trainer logs of the specific run
+│   │
+│   └─── <other_run_dir>
+│        └─ ...
+│
+└─── <other_experiment_name>
+    │
+    ├─── <run_dir>
+    │     └─ ...
+    │
+    └─── <another_run_dir>
+          └─ ...
+```
+
+In this structure:
+
+- `ckpt_best.pth`: Saved whenever there's an improvement in the specified validation metric.
+- `ckpt_latest.pth`: Updated at the end of every epoch.
+- `average_model.pth`: Averaged checkpoint, created if `average_best_models` parameter is set to `True`.
+
+> For more information, check out the [dedicated page](.Checkpoints.md).
+
+### Dataloaders 
+
 Next, we initialize the PyTorch dataloaders for our datasets:
 
 ```python
@@ -602,6 +633,8 @@ train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_wo
 val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False, num_workers=2)
 ```
 
+### Training Hyperparameters 
+
 And lastly, we need to define the training hyperparameters:
 
 ```python
@@ -635,9 +668,9 @@ The above code shows the simplicity of integrating external, user-defined compon
 pipeline. We simply plugged instantiations of our custom loss and metric into the hyperparameters dictionary,
 and we are ready to go.
 
-## 5. Training
+## 6. Training
 
-### 5.A. Training the model
+### 6.A. Training the model
 
 We are all set to start training our model. Simply plug in the model, training and validation dataloaders,
 and training parameters into the trainer's `train()` function:
@@ -696,7 +729,7 @@ SUMMARY OF EPOCH 5
 At the end of each epoch, the different logs and checkpoints are saved in the path defined by `ckpt_root_dir` and
 `experiment_name`. Let's see how we can use Tensorboard to track training process.
 
-### 5.B. Tensorboard logs
+### 6.B. Tensorboard logs
 
 To view the experiment's tensorboard logs, type the following command in the terminal from the
 experiment's path:
@@ -719,7 +752,7 @@ We can also check the validation set's IoU metric's value:
 
 
 
-## 6. Predictions with the trained model
+## 7. Predictions with the trained model
 
 Now that we have a trained model we can use it to make predictions on the test set. First, let's instantiate a test
 dataset:

diff --git a/documentation/source/logs.md b/documentation/source/logs.md
@@ -1,29 +1,31 @@
 # Local Logging
 
-SuperGradients automatically logs locally multiple files that can help you explore your experiments results. This includes 1 tensorboard and 3 .txt files.
+SuperGradients automatically logs multiple files locally that can help you explore your experiments results. 
+This includes 1 tensorboard and 3 .txt files.
+Absolutely. I understand your requirements. Here's a more concise and structured introduction:
 
+### Directory Structure Overview:
+- **ckpt_root_dir**: The root directory where all experiments are stored.
+- **experiment_name**: The specific folder dedicated to your current experiment.
+- **run_dir**: Unique identifier for each training run; contains all associated checkpoints and logs.
 
+> For a deeper dive into checkpoints, visit our [detailed guide](Checkpoints.md).
 
 ## I. Tensorboard logging
 To easily keep track of your experiments, SuperGradients saves your results in `events.out.tfevents` format that can be used by tensorboard.
 
 **What does it include?** This tensorboard includes all of your training and validation metrics but also other information such as learning rate, system metrics (CPU, GPU, ...), and more.
 
-**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/events.out.tfevents.<unique_id>`
-
-**How to launch?**`tensorboard --logdir checkpoint_path/events.out.tfevents.<unique_id>`
-
+**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/events.out.tfevents.<unique_id>`
 
+**How to launch?** `tensorboard --logdir <ckpt_root_dir>/<experiment_name>/<run_dir>`
 
 ## II. Experiment logging
 In case you cannot launch a tensorboard instance, you can still find a summary of your experiment saved in a readable .txt format.
 
 **What does it include?** The experiment configuration and training/validation metrics.
 
-**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/experiment_logs_<date>.txt`
-
-
-
+**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/experiment_logs_<date>.txt`
 
 ## III. Console logging
 For better debugging and understanding of past runs, SuperGradients gathers all the print statements and logs into a 
@@ -33,7 +35,7 @@ local file, providing you the convenience to review console outputs of any exper
 
 **Where is it saved?**
 - Upon importing SuperGradients, console outputs and logs will be stored in `~/sg_logs/console.log`.
-- When instantiating the super_gradients.Trainer, all console outputs and logs will be redirected to the experiment folder `<ckpt_root_dir>/<experiment_name>/console_<date>.txt`.
+- When instantiating the `super_gradients.Trainer`, all console outputs and logs will be redirected to the experiment folder `<ckpt_root_dir>/<experiment_name>/<run_dir>/console_<date>.txt`.
 
 **How to set log level?** You can filter the logs displayed on the console by setting the environment variable `CONSOLE_LOG_LEVEL=<LOG-LEVEL> # DEBUG/INFO/WARNING/ERROR`
 
@@ -45,7 +47,7 @@ This means that it includes any log that was under the logging level (`logging.D
 
 **What does it include?** Anything logged with a logger (`logger.log`, `logger.info`, ...), even the filtered logs.
 
-**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/logs_<date>.txt`
+**Where is it saved?** `<ckpt_root_dir>/<experiment_name>/<run_dir>/logs_<date>.txt`
 
 **How to set log level?** You can filter the logs saved in the file by setting the environment variable `FILE_LOG_LEVEL=<LOG-LEVEL> # DEBUG/INFO/WARNING/ERROR`
 
@@ -54,9 +56,8 @@ This means that it includes any log that was under the logging level (`logging.D
 Only when training using hydra recipe.
 
 **What does it include?**
-```     
- <ckpt_root_dir>/<experiment_name>
-        ├─ ...
+```
+<ckpt_root_dir>/<experiment_name>/<run_dir>/
         └─ .hydra
              ├─config.yaml       # A single config file that regroups the config files used to run the experiment  
              ├─hydra.yaml        # Some Hydra metadata
@@ -65,8 +66,8 @@ Only when training using hydra recipe.
 
 
 ## SUMMARY 
-```     
- <ckpt_root_dir>/<experiment_name>
+```
+<ckpt_root_dir>/<experiment_name>/<run_dir>/
         ├─ ... (all the model checkpoints)
         ├─ events.out.tfevents.<unique_id>  # Tensorboard artifact
         ├─ experiment_logs_<date>.txt       # Config and metrics related to experiment