Skip to content

Commit

Permalink
Feature/sg 1027 checkpoint directory refacto (#1401)
Browse files Browse the repository at this point in the history
* wip-S

* wip

* fix bug where resume would crash if latest run doesnt include latest_ckpt

* remove unwanted change + copy hydra

* minor changes

* fix test

* add tests

* fix test

* New Export API (#1318)

* Designing export API

* Export WIP

* ONNX NMS

* Export WIP

* Refactor test and move benchmark API to functino

* Export WIP

* Make the top_k a constant and not variable since TRT export does not work with dynamic top_k

* Refactor test and move benchmark API to functino

* Added option to change the output format

* Refactor test and move benchmark API to functino

* Added option to change the output format

* Refactor test and move benchmark API to functino

* Fixing export to make it TRT friendly

* Fixing export to make it TRT friendly

* Fixing export to make it TRT friendly

* Fixing export to make it TRT friendly

* Remove unused classes

* Remove unused classes

* Remove unused classes

* Remove unused classes

* Fixing export to FP16

* Fixing export to FP16

* Improve output of the benchmark result

* Improve device handling when exporting NMS

* Improve device handling when exporting NMS

* Fix nms format conversion modules export

* Revert unit test

* Improve model device handling

* Adding docs

* Adding docs

* Adding docs

* Adding docs

* Address TODO's after code review

* Added check whether model is already quantized

* Install pytorch quantization package

* Added printin of user-friendly description on how to use the exported model

* Update docs

* Update docs

* Uninstall SG

* Added onnx_graphsurgeon

* Added onnx_graphsurgeon

* Put extra index url at the top

* Put extra index url before the package that requires it

* Fix --index-url to --extra-index-url

* get_requirements to handle --extra-index-url correctly

* Made method draw_box_title public

* Fix tests

* Fix missing HasPredict for BaseClassifier model

* Make quantization parameters overridable

* Feature/sg 000 fix predict in pose estimation (#1358)

* Update readme

* Fix small bug in __repr__ implementation of KeypointsImageToTensor

* Test

* Test

* Test

* Test

* Test

* Test

* Make graphsurgeon an optional

* Make graphsurgeon an optional

* Properly handle imports of optional packages

* Added empty __init__.py files

* Do imports of gs inside the export call

* Do imports of gs inside the export call

* Fix DEKR's missing HasPredict interface

* Update notebook & example doc to reflect changes in imports & function names

* Update readme

* Put back images

* add model export (#1362)

* fix (#1367)

* fix

* add spacing

* Feature/sg 000 propagate imagenet dataset params (#1368)

* Propagate default dataset processing params for other classification models

* Fix bug in predict pipeline (Softmax was computed along batch dimension AFTER taking max along classes dimension)

* Added more classification models to test

* Doc changes (#1253)

* num classes specified was wrong

* wrong num_classes specified

---------

Co-authored-by: Ofri Masad <ofrimasad@users.noreply.github.com>
Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>

* Summarize models, losses & metrics for segmentation (#1354)

* Summarize models, losses & metrics

* Added troubleshoothing section

* Feature/sg 000 fix import of onnx graphsurgeon (#1359)

* Update readme

* Fix small bug in __repr__ implementation of KeypointsImageToTensor

* Test

* Test

* Test

* Test

* Test

* Test

* Make graphsurgeon an optional

* Make graphsurgeon an optional

* Properly handle imports of optional packages

* Added empty __init__.py files

* Do imports of gs inside the export call

* Do imports of gs inside the export call

* Fix DEKR's missing HasPredict interface

* Update notebook & example doc to reflect changes in imports & function names

* Update readme

* Put back images

* Install onnx_graphsurgeon in CI

* Install onnx_graphsurgeon in CI

* Fix version of ONNX-GS installed in CI and installed on-demand

* Fix arange_cpu not implemented for Half

* Fix arange_cpu not implemented for Half

* Fix graph merging for old pytorch (1.12) that crashed because of nodes with duplicate names

* Feature/sg 1047 predict od with labels (#1365)

* cleanup start

* added docs

* added tests

* added tests + fix yolox

* fixed ppyoloe

* fixed ppyoloe

* small ppyoloe prep model for conversion fix

* small ppyoloe prep model for conversion fix

* fixed image_i_object_count ref docs

* alligned box thickness

* renamed vars in example

* changed statement and added len verification

* fixed predictions docs

* fixed pipelines docs

* removed gt text from plots

* removed gt text from plots

* refactored predict with labels to use show/save

* Feature/sg 1033 fix yolox anchors (#1369)

* Update readme

* Fix small bug in __repr__ implementation of KeypointsImageToTensor

* Test

* Test

* Test

* Test

* Test

* Test

* Make graphsurgeon an optional

* Make graphsurgeon an optional

* Properly handle imports of optional packages

* Added empty __init__.py files

* Do imports of gs inside the export call

* Do imports of gs inside the export call

* Fix DEKR's missing HasPredict interface

* Update notebook & example doc to reflect changes in imports & function names

* Update readme

* Put back images

* Install onnx_graphsurgeon in CI

* Install onnx_graphsurgeon in CI

* Working prototype of YoloX fix of Anchors that can load model weights as well

* Added more tests for detection predict() and yolox checkpoint loading

* Fix version of ONNX-GS installed in CI and installed on-demand

* Added docs

* Added docs

* Added docs

* Remove leftover

* Set ignore_errors=True to trainer test and declare why

* Fix bug in maybe_remove_module_prefix

* version bumped (#1374)

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>

* Adding pose estimation to the readme (#1375)

* Update readme

* Fix small bug in __repr__ implementation of KeypointsImageToTensor

* Update pose estimation image

* Fix link

* Remove broken link that we can't recover where it was pointing to (#1376)

* test on first run

* improve doc with new checkpoint run logic

* add doc for Trainer

* update doc

* test sweeper

* test sweeper

* test sweeper

* test sweeper

* test sweeper

* test sweeper

* test direcrlt call

* fix indentation

* fix indentation

* move test to integration test

* fix

* add makefiletest

* iupdate

* remove test from base ci test

---------

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
Co-authored-by: Pranoy Radhakrishnan <pranoyalkr@gmail.com>
Co-authored-by: Ofri Masad <ofrimasad@users.noreply.github.com>
Co-authored-by: Shay Aharon <80472096+shaydeci@users.noreply.github.com>
  • Loading branch information
5 people committed Aug 22, 2023
1 parent 052af61 commit 0b4710a
Show file tree
Hide file tree
Showing 13 changed files with 471 additions and 118 deletions.
46 changes: 45 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,6 @@ jobs:
. venv/bin/activate
python3 -m pip install pytorch-quantization==2.1.2 --extra-index-url https://pypi.ngc.nvidia.com
python3 -m pip install onnx_graphsurgeon==0.3.27 --extra-index-url https://pypi.ngc.nvidia.com
- run:
name: run tests with coverage
no_output_timeout: 30m
Expand All @@ -214,6 +213,7 @@ jobs:
coverage run --source=super_gradients -m unittest tests/deci_core_unit_test_suite_runner.py
coverage report
coverage html # open htmlcov/index.html in a browser
- store_artifacts:
path: htmlcov

Expand Down Expand Up @@ -448,6 +448,42 @@ jobs:
tag: $CIRCLE_TAG
notes: "This GitHub Release was done automatically by CircleCI"

hydra_sweeper_test:
docker:
- image: 307629990626.dkr.ecr.us-east-1.amazonaws.com/deci/infra/circleci/runner/sg-gpu:<< pipeline.parameters.sg_docker_version >>
resource_class: deci-ai/sg-gpu-on-premise
parameters:
sg_existing_env_path:
type: string
default: "/env/persistent_env"
sg_new_env_name:
type: string
default: "${CIRCLE_BUILD_NUM}"
sg_new_env_python_version:
type: string
default: "python3.8"
steps:
- checkout
- run:
name: install requirements and run recipe tests
command: |
<< parameters.sg_new_env_python_version >> -m venv << parameters.sg_new_env_name >>
source << parameters.sg_new_env_name >>/bin/activate
python3.8 -m pip install --upgrade setuptools pip wheel
python3.8 -m pip install -r requirements.txt
python3.8 -m pip install .
python3.8 -m pip install torch torchvision torchaudio
make sweeper_test
- run:
name: Remove new environment when failed
command: "rm -r << parameters.sg_new_env_name >>"
when: on_fail

- slack/notify:
channel: "sg-integration-tests"
event: fail
template: basic_fail_1 # see https://github.com/CircleCI-Public/slack-orb/wiki#templates.

recipe_accuracy_tests:
docker:
Expand Down Expand Up @@ -758,6 +794,9 @@ workflows:
sanity_tests:
when: << pipeline.parameters.run_sanity_tests_flow >>
jobs:
- hydra_sweeper_test:
context:
- slack
- recipe_sanity_tests_classification_pt1:
context:
- slack
Expand Down Expand Up @@ -798,6 +837,10 @@ workflows:
- deci-common/persist_version_info
- login_to_codeartifact_release
<<: *release_tag_filter
- hydra_sweeper_test:
context:
- slack
<<: *release_tag_filter
- recipe_sanity_tests_classification_pt1:
context:
- slack
Expand Down Expand Up @@ -826,6 +869,7 @@ workflows:
py_version: "3.8"
requires:
- "build3.8"
- hydra_sweeper_test
- recipe_accuracy_tests
- recipe_sanity_tests_classification_pt1
- recipe_sanity_tests_classification_pt2
Expand Down
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,20 @@ recipe_accuracy_tests:
coverage run --source=super_gradients -m unittest tests/deci_core_recipe_test_suite_runner.py


sweeper_test:
python -m super_gradients.train_from_recipe -m --config-name=cifar10_resnet \
ckpt_root_dir=$$PWD \
experiment_name=sweep_cifar10 \
training_hyperparams.max_epochs=1 \
training_hyperparams.initial_lr=0.001,0.01

# Make sure that experiment_dir includes $$expected_num_dir subdirectories. If not, fail
subdir_count=$$(find "$$PWD/sweep_cifar10" -mindepth 1 -maxdepth 1 -type d | wc -l); \
if [ "$$subdir_count" -ne 2 ]; then \
echo "Error: $$PWD/sweep_cifar10 should include 2 subdirectories but includes $$subdir_count."; \
exit 1; \
fi


examples_to_docs:
jupyter nbconvert --to markdown --output-dir="documentation/source/" --execute src/super_gradients/examples/model_export/models_export.ipynb
96 changes: 72 additions & 24 deletions documentation/source/Checkpoints.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,13 @@ That's why in SG, multiple checkpoints are saved throughout training:

#### Where are the checkpoint files saved?

The checkpoint files will be saved at <PATH_TO_CKPT_ROOT_DIR>/experiment_name/.
The checkpoint files will be saved at `<ckpt_root_dir>/<experiment_name>/<run_dir>`.

The user controls the checkpoint root directory, which can be passed to the `Trainer` constructor through the `ckpt_root_dir` argument.
- `ckpt_root_dir` and `experiment_name` can be set by the user when instantiating the `Trainer`.
```python
Trainer(ckpt_root_dir='path/to/ckpt_root_dir', experiment_name="my_experiment")
```
- `run_dir` is unique and automatically generated each time you start a new training, with `trainer.train(...)`

When working with a cloned version of SG, one can leave out the `ckpt_root_dir` arg, and checkpoints will be saved under the `super_gradients/checkpoints` directory.

Expand Down Expand Up @@ -87,14 +91,19 @@ Then at the end of the training, our `ckpt_root_dir` contents will look similar

```
my_checkpoints_folder
|─── my_resnet18_training_experiment
│ ckpt_best.pth # Model checkpoint on best epoch
│ ckpt_latest.pth # Model checkpoint on last epoch
│ average_model.pth # Model checkpoint averaged over epochs
| ckpt_epoch_10.pth # Model checkpoint of epoch 10
| ckpt_epoch_15.pth # Model checkpoint of epoch 15
│ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
│ log_Aug07_11_52_48.txt # Trainer logs of a specific run
├─── my_resnet18_training_experiment
│ ├── RUN_20230802_131052_651906
│ │ ├─ ckpt_best.pth # Model checkpoint on best epoch
│ │ ├─ ckpt_latest.pth # Model checkpoint on last epoch
│ │ ├─ average_model.pth # Model checkpoint averaged over epochs
│ │ ├─ ckpt_epoch_10.pth # Model checkpoint of epoch 10
│ │ ├─ ckpt_epoch_15.pth # Model checkpoint of epoch 15
│ │ ├─ events.out.tfevents.1659878383... # Tensorflow artifacts of a specific run
│ │ └─ log_Aug02_13_10_52.txt # Trainer logs of a specific run
│ │
│ └─ RUN_20230803_121652_243212
│ └─ ...
└─── some_other_training_experiment_name
...
```
Expand All @@ -105,7 +114,7 @@ Suppose we wish to load the weights from `ckpt_best.pth`. We can simply pass its
from super_gradients.training import models
from super_gradients.common.object_names import Models

model = models.get(model_name=Models.RESNET18, num_classes=10, checkpoint_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/ckpt_best.pth")
model = models.get(model_name=Models.RESNET18, num_classes=10, checkpoint_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
```

> Important: when loading SG-trained checkpoints using models.get(...), if the network was trained with EMA, the EMA weights will be the ones loaded.
Expand All @@ -118,7 +127,7 @@ from super_gradients.common.object_names import Models
from super_gradients.training.utils.checkpoint_utils import load_checkpoint_to_model

model = models.get(model_name=Models.RESNET18, num_classes=10)
load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/ckpt_best.pth")
load_checkpoint_to_model(net=model, ckpt_local_path="/path/to/my_checkpoints_folder/my_resnet18_training_experiment/RUN_20230802_131052_651906/ckpt_best.pth")
```
### Extending the Functionality of PyTorch's `strict` Parameter in `load_state_dict()`

Expand Down Expand Up @@ -237,32 +246,71 @@ def train_from_config(cls, cfg: Union[DictConfig, dict]) -> Tuple[nn.Module, Tup

## Resuming Training

In SG, we separate the logic of resuming training from loading model weights. Therefore, continuing training is controlled by two arguments, passed through `training_params`: `resume` and `resume_path`:
Resuming training in SG is a comprehensive process, controlled by three primary parameters that allow flexibility
in continuing or branching off from specific training checkpoints.
These parameters are used within `training_params`: `resume`, `run_id`, and `resume_path`.

```yaml
...
resume: False # whether to continue training from ckpt with the same experiment name.
resume_path: # Explicit checkpoint path (.pth file) to resume training.
resume: False # Option to continue training from the latest checkpoint.
run_id: # ID to resume from a specific run within the same experiment.
resume_path: # Direct path to a specific checkpoint file (.pth) to resume training.

...
```

Setting `resume=True` will take the training related state_dicts from `/PATH/TO/MY_CKPT_ROOT_DIR/MY_EXPERIMENT_NAME/ckpt_latest.pth`.
Stating explicitly a `resume_path` will continue training from an explicit checkpoint.

In both cases, SG allows flexibility of the other training-related parameters. For example, we can resume a training experiment and run it for more epochs:
#### 1. Resuming the Latest Run

By setting `resume=True`, SuperGradients will resume training from the last checkpoint within the same experiment.

Example:

```shell
python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True training_hyperparams.max_epochs=300
# Continues from the latest run in the cifar_experiment.
python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True
```

#### 2. Resuming a Specific Run

Using `run_id`, you can resume training from a specific run within the same experiment, identified by the run ID.

Example:

```shell
python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume=True training_hyperparams.max_epochs=400
# Continues from a specific run identified by the ID within cifar_experiment.
python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment run_id=RUN_20230802_131052_651906
```

However, this flexibility comes with a price: we must be aware of any change in parameters (by command line overrides or hard-coded changes inside the yaml file configurations) if we wish to resume training.
#### 3. Branching off from a specific checkpoint

By specifying a `resume_path`, SuperGradients will create a new run directory,
allowing training to resume from that specific checkpoint,
and subsequently save the new checkpoints in this new directory.

Example:

```shell
# Branches from a specific checkpoint, creating a new run.
python -m super_gradients.train_from_recipe --config-name=cifar10_resnet experiment_name=cifar_experiment training_hyperparams.resume_path=/path/to/checkpoint.pth
```

#### 4. Resuming with original recipe
Resuming is parameter dependant - you cannot resume the training of a model if there is a mismatch between the model architecture defined in your recipe, and the one in your checkpoint.

Therefore, if you trained a model a while ago, and that in the meantime you changed the model architecture definition,
then you won't be able to resume its training, loading the model would simply raise an exception.


To avoid this issue, SuperGradients provides an option to resume a training based on the recipe that was originally used to train the model.

```
Trainer.resume_experiment(ckpt_root_dir=..., experiment_name=..., run_id=...)
```

- `run_id` is optional. You can use it to chose which run you want to resume. By default, it will resume the latest run of your experiment.

Note that `Trainer.resume_experiment` can only resume training that were launched with `Trainer.train_from_config`.

For this reason, SG also offers a safer option for resuming interrupted training - the `Trainer.resume_experiment(...)` method. It takes two arguments: `experiment_name` - the experiment's name to continue, and `ckpt_root_dir` - the directory including the checkpoints. It will resume training with the same settings the training was launched with.
Note that resuming training this way requires the interrupted training to be launched with configuration files (i.e., `Trainer.train_from_config`), which outputs the Hydra final config to the `.hydra` directory inside the checkpoints directory.
See usage in our [resume_experiment_example](https://github.com/Deci-AI/super-gradients/blob/master/src/super_gradients/examples/resume_experiment_example/resume_experiment.py).

## Resuming Training from SG Logger's Remote Storage (WandB only)
Expand Down
5 changes: 3 additions & 2 deletions src/super_gradients/common/environment/cfg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def load_recipe(config_name: str, recipes_dir_path: Optional[str] = None, overri
return cfg


def load_experiment_cfg(experiment_name: str, ckpt_root_dir: str = None) -> DictConfig:
def load_experiment_cfg(experiment_name: str, ckpt_root_dir: Optional[str] = None, run_id: Optional[str] = None) -> DictConfig:
"""
Load the hydra config associated to a specific experiment.
Expand All @@ -65,12 +65,13 @@ def load_experiment_cfg(experiment_name: str, ckpt_root_dir: str = None) -> Dict
:param experiment_name: Name of the experiment to resume
:param ckpt_root_dir: Directory including the checkpoints
:param run_id: Optional. Run id of the experiment. If None, the most recent run will be loaded.
:return: The config that was used for that experiment
"""
if not experiment_name:
raise ValueError(f"experiment_name should be non empty string but got :{experiment_name}")

checkpoints_dir_path = Path(get_checkpoints_dir_path(experiment_name, ckpt_root_dir))
checkpoints_dir_path = Path(get_checkpoints_dir_path(ckpt_root_dir=ckpt_root_dir, experiment_name=experiment_name, run_id=run_id))
if not checkpoints_dir_path.exists():
raise FileNotFoundError(f"Impossible to find checkpoint dir ({checkpoints_dir_path})")

Expand Down
Loading

0 comments on commit 0b4710a

Please sign in to comment.