Feature/sg 1027 checkpoint directory refacto (#1401)

* wip-S * wip * fix bug where resume would crash if latest run doesnt include latest_ckpt * remove unwanted change + copy hydra * minor changes * fix test * add tests * fix test * New Export API (#1318) * Designing export API * Export WIP * ONNX NMS * Export WIP * Refactor test and move benchmark API to functino * Export WIP * Make the top_k a constant and not variable since TRT export does not work with dynamic top_k * Refactor test and move benchmark API to functino * Added option to change the output format * Refactor test and move benchmark API to functino * Added option to change the output format * Refactor test and move benchmark API to functino * Fixing export to make it TRT friendly * Fixing export to make it TRT friendly * Fixing export to make it TRT friendly * Fixing export to make it TRT friendly * Remove unused classes * Remove unused classes * Remove unused classes * Remove unused classes * Fixing export to FP16 * Fixing export to FP16 * Improve output of the benchmark result * Improve device handling when exporting NMS * Improve device handling when exporting NMS * Fix nms format conversion modules export * Revert unit test * Improve model device handling * Adding docs * Adding docs * Adding docs * Adding docs * Address TODO's after code review * Added check whether model is already quantized * Install pytorch quantization package * Added printin of user-friendly description on how to use the exported model * Update docs * Update docs * Uninstall SG * Added onnx_graphsurgeon * Added onnx_graphsurgeon * Put extra index url at the top * Put extra index url before the package that requires it * Fix --index-url to --extra-index-url * get_requirements to handle --extra-index-url correctly * Made method draw_box_title public * Fix tests * Fix missing HasPredict for BaseClassifier model * Make quantization parameters overridable * Feature/sg 000 fix predict in pose estimation (#1358) * Update readme * Fix small bug in __repr__ implementation of KeypointsImageToTensor * Test * Test * Test * Test * Test * Test * Make graphsurgeon an optional * Make graphsurgeon an optional * Properly handle imports of optional packages * Added empty __init__.py files * Do imports of gs inside the export call * Do imports of gs inside the export call * Fix DEKR's missing HasPredict interface * Update notebook & example doc to reflect changes in imports & function names * Update readme * Put back images * add model export (#1362) * fix (#1367) * fix * add spacing * Feature/sg 000 propagate imagenet dataset params (#1368) * Propagate default dataset processing params for other classification models * Fix bug in predict pipeline (Softmax was computed along batch dimension AFTER taking max along classes dimension) * Added more classification models to test * Doc changes (#1253) * num classes specified was wrong * wrong num_classes specified --------- Co-authored-by: Ofri Masad <ofrimasad@users.noreply.github.com> Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> * Summarize models, losses & metrics for segmentation (#1354) * Summarize models, losses & metrics * Added troubleshoothing section * Feature/sg 000 fix import of onnx graphsurgeon (#1359) * Update readme * Fix small bug in __repr__ implementation of KeypointsImageToTensor * Test * Test * Test * Test * Test * Test * Make graphsurgeon an optional * Make graphsurgeon an optional * Properly handle imports of optional packages * Added empty __init__.py files * Do imports of gs inside the export call * Do imports of gs inside the export call * Fix DEKR's missing HasPredict interface * Update notebook & example doc to reflect changes in imports & function names * Update readme * Put back images * Install onnx_graphsurgeon in CI * Install onnx_graphsurgeon in CI * Fix version of ONNX-GS installed in CI and installed on-demand * Fix arange_cpu not implemented for Half * Fix arange_cpu not implemented for Half * Fix graph merging for old pytorch (1.12) that crashed because of nodes with duplicate names * Feature/sg 1047 predict od with labels (#1365) * cleanup start * added docs * added tests * added tests + fix yolox * fixed ppyoloe * fixed ppyoloe * small ppyoloe prep model for conversion fix * small ppyoloe prep model for conversion fix * fixed image_i_object_count ref docs * alligned box thickness * renamed vars in example * changed statement and added len verification * fixed predictions docs * fixed pipelines docs * removed gt text from plots * removed gt text from plots * refactored predict with labels to use show/save * Feature/sg 1033 fix yolox anchors (#1369) * Update readme * Fix small bug in __repr__ implementation of KeypointsImageToTensor * Test * Test * Test * Test * Test * Test * Make graphsurgeon an optional * Make graphsurgeon an optional * Properly handle imports of optional packages * Added empty __init__.py files * Do imports of gs inside the export call * Do imports of gs inside the export call * Fix DEKR's missing HasPredict interface * Update notebook & example doc to reflect changes in imports & function names * Update readme * Put back images * Install onnx_graphsurgeon in CI * Install onnx_graphsurgeon in CI * Working prototype of YoloX fix of Anchors that can load model weights as well * Added more tests for detection predict() and yolox checkpoint loading * Fix version of ONNX-GS installed in CI and installed on-demand * Added docs * Added docs * Added docs * Remove leftover * Set ignore_errors=True to trainer test and declare why * Fix bug in maybe_remove_module_prefix * version bumped (#1374) Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> * Adding pose estimation to the readme (#1375) * Update readme * Fix small bug in __repr__ implementation of KeypointsImageToTensor * Update pose estimation image * Fix link * Remove broken link that we can't recover where it was pointing to (#1376) * test on first run * improve doc with new checkpoint run logic * add doc for Trainer * update doc * test sweeper * test sweeper * test sweeper * test sweeper * test sweeper * test sweeper * test direcrlt call * fix indentation * fix indentation * move test to integration test * fix * add makefiletest * iupdate * remove test from base ci test --------- Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> Co-authored-by: Pranoy Radhakrishnan <pranoyalkr@gmail.com> Co-authored-by: Ofri Masad <ofrimasad@users.noreply.github.com> Co-authored-by: Shay Aharon <80472096+shaydeci@users.noreply.github.com>
Deci-AI · Aug 22, 2023 · 0b4710a · 0b4710a
1 parent 052af61
commit 0b4710a
Show file tree

Hide file tree

Showing 13 changed files with 471 additions and 118 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -205,7 +205,6 @@ jobs:
             . venv/bin/activate
             python3 -m pip install pytorch-quantization==2.1.2 --extra-index-url https://pypi.ngc.nvidia.com
             python3 -m pip install onnx_graphsurgeon==0.3.27   --extra-index-url https://pypi.ngc.nvidia.com
-
       - run:
           name: run tests with coverage
           no_output_timeout: 30m
@@ -214,6 +213,7 @@ jobs:
             coverage run --source=super_gradients -m unittest tests/deci_core_unit_test_suite_runner.py
             coverage report
             coverage html  # open htmlcov/index.html in a browser
+
       - store_artifacts:
           path: htmlcov
 
@@ -448,6 +448,42 @@ jobs:
           tag: $CIRCLE_TAG
           notes: "This GitHub Release was done automatically by CircleCI"
 
+  hydra_sweeper_test:
+    docker:
+      - image: 307629990626.dkr.ecr.us-east-1.amazonaws.com/deci/infra/circleci/runner/sg-gpu:<< pipeline.parameters.sg_docker_version >>
+    resource_class: deci-ai/sg-gpu-on-premise
+    parameters:
+      sg_existing_env_path:
+        type: string
+        default: "/env/persistent_env"
+      sg_new_env_name:
+        type: string
+        default: "${CIRCLE_BUILD_NUM}"
+      sg_new_env_python_version:
+        type: string
+        default: "python3.8"
+    steps:
+      - checkout
+      - run:
+          name: install requirements and run recipe tests
+          command: |
+            << parameters.sg_new_env_python_version >> -m venv << parameters.sg_new_env_name >>
+            source << parameters.sg_new_env_name >>/bin/activate
+            python3.8 -m pip install --upgrade setuptools pip wheel
+            python3.8 -m pip install -r requirements.txt
+            python3.8 -m pip install .
+            python3.8 -m pip install torch torchvision torchaudio   
+            make sweeper_test
+
+      - run:
+          name: Remove new environment when failed
+          command: "rm -r << parameters.sg_new_env_name >>"
+          when: on_fail
+
+      - slack/notify:
+          channel: "sg-integration-tests"
+          event: fail
+          template: basic_fail_1 # see https://github.com/CircleCI-Public/slack-orb/wiki#templates.
 
   recipe_accuracy_tests:
     docker:
@@ -758,6 +794,9 @@ workflows:
   sanity_tests:
     when:  << pipeline.parameters.run_sanity_tests_flow >>
     jobs:
+      - hydra_sweeper_test:
+          context:
+            - slack
       - recipe_sanity_tests_classification_pt1:
           context:
             - slack
@@ -798,6 +837,10 @@ workflows:
             - deci-common/persist_version_info
             - login_to_codeartifact_release
           <<: *release_tag_filter
+      - hydra_sweeper_test:
+          context:
+            - slack
+          <<: *release_tag_filter
       - recipe_sanity_tests_classification_pt1:
           context:
             - slack
@@ -826,6 +869,7 @@ workflows:
           py_version: "3.8"
           requires:
             - "build3.8"
+            - hydra_sweeper_test
             - recipe_accuracy_tests
             - recipe_sanity_tests_classification_pt1
             - recipe_sanity_tests_classification_pt2

diff --git a/Makefile b/Makefile
@@ -16,5 +16,20 @@ recipe_accuracy_tests:
 	coverage run --source=super_gradients -m unittest tests/deci_core_recipe_test_suite_runner.py
 
 
+sweeper_test:
+	python -m super_gradients.train_from_recipe -m --config-name=cifar10_resnet \
+	  ckpt_root_dir=$$PWD \
+	  experiment_name=sweep_cifar10 \
+	  training_hyperparams.max_epochs=1 \
+	  training_hyperparams.initial_lr=0.001,0.01
+
+	# Make sure that experiment_dir includes $$expected_num_dir subdirectories. If not, fail
+	subdir_count=$$(find "$$PWD/sweep_cifar10" -mindepth 1 -maxdepth 1 -type d | wc -l); \
+	if [ "$$subdir_count" -ne 2 ]; then \
+	  echo "Error: $$PWD/sweep_cifar10 should include 2 subdirectories but includes $$subdir_count."; \
+	  exit 1; \
+	fi
+
+
 examples_to_docs:
 	jupyter nbconvert --to markdown --output-dir="documentation/source/" --execute src/super_gradients/examples/model_export/models_export.ipynb
diff --git a/documentation/source/Checkpoints.md b/documentation/source/Checkpoints.md
@@ -23,9 +23,13 @@ That's why in SG, multiple checkpoints are saved throughout training:
 
 #### Where are the checkpoint files saved?
 
-The checkpoint files will be saved at <PATH_TO_CKPT_ROOT_DIR>/experiment_name/.
+The checkpoint files will be saved at `<ckpt_root_dir>/<experiment_name>/<run_dir>`.
 
-The user controls the checkpoint root directory, which can be passed to the `Trainer` constructor through the `ckpt_root_dir` argument.
+- `ckpt_root_dir` and `experiment_name` can be set by the user when instantiating the `Trainer`.
+```python
+Trainer(ckpt_root_dir='path/to/ckpt_root_dir', experiment_name="my_experiment")
+```
+- `run_dir` is unique and automatically generated each time you start a new training, with `trainer.train(...)`
 
 When working with a cloned version of SG, one can leave out the `ckpt_root_dir` arg, and checkpoints will be saved under the `super_gradients/checkpoints` directory.
 
@@ -87,14 +91,19 @@ Then at the end of the training, our `ckpt_root_dir` contents will look similar
 
 ```
 my_checkpoints_folder
-|─── my_resnet18_training_experiment
-│       ckpt_best.pth                     # Model checkpoint on best epoch
-│       ckpt_latest.pth                   # Model checkpoint on last epoch
-│       average_model.pth                 # Model checkpoint averaged over epochs
-|       ckpt_epoch_10.pth                 # Model checkpoint of epoch 10
-|       ckpt_epoch_15.pth                 # Model checkpoint of epoch 15
-│       events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
-│       log_Aug07_11_52_48.txt            # Trainer logs of a specific run
+├─── my_resnet18_training_experiment
+│   ├── RUN_20230802_131052_651906
+│   │     ├─ ckpt_best.pth                     # Model checkpoint on best epoch
+│   │     ├─ ckpt_latest.pth                   # Model checkpoint on last epoch
+│   │     ├─ average_model.pth                 # Model checkpoint averaged over epochs
+│   │     ├─ ckpt_epoch_10.pth                 # Model checkpoint of epoch 10
+│   │     ├─ ckpt_epoch_15.pth                 # Model checkpoint of epoch 15
+│   │     ├─ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
+│   │     └─ log_Aug02_13_10_52.txt            # Trainer logs of a specific run
+│   │
+│   └─ RUN_20230803_121652_243212
+│        └─ ...
+│
 └─── some_other_training_experiment_name
         ...
 ```
@@ -105,7 +114,7 @@ Suppose we wish to load the weights from `ckpt_best.pth`. We can simply pass its
 from super_gradients.training import models
 from super_gradients.common.object_names import Models
 
-model = models.get(model_name=Models.RESNET18, num_classes=10, checkpoint_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/ckpt_best.pth")
+model = models.get(model_name=Models.RESNET18, num_classes=10, checkpoint_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
 ```
 
 > Important: when loading SG-trained checkpoints using models.get(...), if the network was trained with EMA, the EMA weights will be the ones loaded. 
@@ -118,7 +127,7 @@ from super_gradients.common.object_names import Models
 from super_gradients.training.utils.checkpoint_utils import load_checkpoint_to_model
 
 model = models.get(model_name=Models.RESNET18, num_classes=10)
-load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/ckpt_best.pth")
+load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
 ```
 ### Extending the Functionality of PyTorch's `strict` Parameter in `load_state_dict()`
 
@@ -237,32 +246,71 @@ def train_from_config(cls, cfg: Union[DictConfig, dict]) -> Tuple[nn.Module, Tup
 
 ## Resuming Training
 
-In SG, we separate the logic of resuming training from loading model weights. Therefore, continuing training is controlled by two arguments, passed through `training_params`: `resume` and `resume_path`:
+Resuming training in SG is a comprehensive process, controlled by three primary parameters that allow flexibility 
+in continuing or branching off from specific training checkpoints. 
+These parameters are used within `training_params`: `resume`, `run_id`, and `resume_path`.
+
 ```yaml
-...
-resume: False # whether to continue training from ckpt with the same experiment name.
-resume_path: # Explicit checkpoint path (.pth file) to resume training.
+resume: False  # Option to continue training from the latest checkpoint.
+run_id:        # ID to resume from a specific run within the same experiment.
+resume_path:   # Direct path to a specific checkpoint file (.pth) to resume training.
+
 ...
 ```
 
-Setting `resume=True` will take the training related state_dicts from `/PATH/TO/MY_CKPT_ROOT_DIR/MY_EXPERIMENT_NAME/ckpt_latest.pth`.
-Stating explicitly a `resume_path` will continue training from an explicit checkpoint.
 
-In both cases, SG allows flexibility of the other training-related parameters. For example, we can resume a training experiment and run it for more epochs:
+#### 1. Resuming the Latest Run
+
+By setting `resume=True`, SuperGradients will resume training from the last checkpoint within the same experiment.
 
+Example:
 
 ```shell
-python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True training_hyperparams.max_epochs=300
+# Continues from the latest run in the cifar_experiment.
+python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True
 ```
 
+#### 2. Resuming a Specific Run
+
+Using `run_id`, you can resume training from a specific run within the same experiment, identified by the run ID.
+
+Example:
+
 ```shell
-python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True training_hyperparams.max_epochs=400
+# Continues from a specific run identified by the ID within cifar_experiment.
+python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment run_id=RUN_20230802_131052_651906
 ```
 
-However, this flexibility comes with a price: we must be aware of any change in parameters (by command line overrides or hard-coded changes inside the yaml file configurations) if we wish to resume training.
+#### 3. Branching off from a specific checkpoint
+
+By specifying a `resume_path`, SuperGradients will create a new run directory, 
+allowing training to resume from that specific checkpoint, 
+and subsequently save the new checkpoints in this new directory.
+
+Example:
+
+```shell
+# Branches from a specific checkpoint, creating a new run.
+python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume_path=/path/to/checkpoint.pth
+```
+
+#### 4. Resuming with original recipe
+Resuming is parameter dependant - you cannot resume the training of a model if there is a mismatch between the model architecture defined in your recipe, and the one in your checkpoint.
+
+Therefore, if you trained a model a while ago, and that in the meantime you changed the model architecture definition,
+then you won't be able to resume its training, loading the model would simply raise an exception.
+
+
+To avoid this issue, SuperGradients provides an option to resume a training based on the recipe that was originally used to train the model. 
+
+```
+Trainer.resume_experiment(ckpt_root_dir=..., experiment_name=..., run_id=...)
+```
+
+- `run_id` is optional. You can use it to chose which run you want to resume. By default, it will resume the latest run of your experiment.
+
+Note that  `Trainer.resume_experiment` can only resume training that were launched with `Trainer.train_from_config`.
 
-For this reason, SG also offers a safer option for resuming interrupted training - the `Trainer.resume_experiment(...)` method. It takes two arguments: `experiment_name` - the experiment's name to continue, and `ckpt_root_dir` - the directory including the checkpoints. It will resume training with the same settings the training was launched with.
-Note that resuming training this way requires the interrupted training to be launched with configuration files (i.e., `Trainer.train_from_config`), which outputs the Hydra final config to the `.hydra` directory inside the checkpoints directory.
 See usage in our [resume_experiment_example](https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/examples/resume_experiment_example/resume_experiment.py).
 
 ## Resuming Training from SG Logger's Remote Storage (WandB only)

diff --git a/src/super_gradients/common/environment/cfg_utils.py b/src/super_gradients/common/environment/cfg_utils.py
@@ -52,7 +52,7 @@ def load_recipe(config_name: str, recipes_dir_path: Optional[str] = None, overri
     return cfg
 
 
-def load_experiment_cfg(experiment_name: str, ckpt_root_dir: str = None) -> DictConfig:
+def load_experiment_cfg(experiment_name: str, ckpt_root_dir: Optional[str] = None, run_id: Optional[str] = None) -> DictConfig:
     """
     Load the hydra config associated to a specific experiment.
 
@@ -65,12 +65,13 @@ def load_experiment_cfg(experiment_name: str, ckpt_root_dir: str = None) -> Dict
 
     :param experiment_name:     Name of the experiment to resume
     :param ckpt_root_dir:       Directory including the checkpoints
+    :param run_id:              Optional. Run id of the experiment. If None, the most recent run will be loaded.
     :return:                    The config that was used for that experiment
     """
     if not experiment_name:
         raise ValueError(f"experiment_name should be non empty string but got :{experiment_name}")
 
-    checkpoints_dir_path = Path(get_checkpoints_dir_path(experiment_name, ckpt_root_dir))
+    checkpoints_dir_path = Path(get_checkpoints_dir_path(ckpt_root_dir=ckpt_root_dir, experiment_name=experiment_name, run_id=run_id))
     if not checkpoints_dir_path.exists():
         raise FileNotFoundError(f"Impossible to find checkpoint dir ({checkpoints_dir_path})")