OneCycle LR, LR finder, custom Tensorboard, etc. #70

erwulff · 2021-06-29T08:07:19Z

This pull request includes several new features.

OneCycleScheduler

The OneCycleScheduler is a tf.keras.optimizers.schedules.LearningRateSchedule that schedules the learning rate on a 1cycle policy as per Leslie Smith's paper (https://arxiv.org/pdf/1803.09820.pdf). The implementation adopts additional improvements as per the fastai library (https://docs.fast.ai/callbacks.one_cycle.html) where only two phases are used and the adaptation is done using cosine annealing.

In my experience the OneCycle policy improves generalization and speeds up learning.

Learning Rate Finder

The learning rate finder implements a technique to easily estimate a range of learning rates that should perform well given the current model setup. When the model architecture or other hyperparameters are changed, the learning rate finder can be run in order to find a new suitable LR range.

The learning rate finder starts training the model at a very low LR, increasing it every batch. The batch loss is plotted vs LR (or, equivalently, training steps) and a figure is created from which a suitable LR range can be determined.

This technique was first introduced by Leslie Smith in
https://arxiv.org/abs/1506.01186.

Simply run

python mlpf/pipeline.py find-lr -c parameters/<config-filename>.yaml

and a plot of loss vs learning rate like the one below will be created. A suitable LR range lies somewhere in the negative slope of the curve, where the loss is steadily decreasing at a relatively high rate.

`Pipeline.py`

mlpf/pipeline.py is my beginning of a click based alternative to the mlpf/launcher.py. I created it, in part, to not change the mlpf/launcher.py too much in one pull request. If mlpf/pipeline.py is well liked it might be able to replace mlpf/launcher.py sometime in the future. Right now, it is still a work in progress.

Other notes

The learning rate is no longer scaled by the batch size. Instead, the learning rate used is now the same as explicitly defined in the configuration file. When using the exponential decay schedule or the OneCycle schedule, the LR specified in the config will be the maximum LR used in the schedule.

The structure of the training directory has been reorganized. Instead of writing many files directly in the training directory they have been organized in different subfolders:

history: contains the history_{}.json, event_{}.pdf and cm_normed.pdf files
weights: contains all checkpoints of model weights
evaluation: contains the pred.npz file(s)
tensorboard_logs: contains the tensorboard logs

This commit also includes - Custom tensorboard callback logging learning rate & momentum - A utils.py file collecting utilities used in more than one file - Clean-up of how output files are organized - Configuration files using the OneCycle scheduler

`mlpf/pipeline.py` is the beginning of a `click` based alternative to the `mlpf/launcher.py`.

Also add option to give a prefix to the name of the training directory

Also add lr_schedule parameter to configuration files

The previous commit still scaled the LR, this one fixes it.

- create get_train_val_datasets() function to get datasets for training - move targets_multi_output() from model_setup.py to utils.py for more flexible access (solving import loop issue)

The learning rate finder implements a technique to easily estimate a range of learning rates that should perform well given the current model setup. When the model architecture or other hyperparameters are changed, the learning rate finder can be run in order to find a new suitable LR range. The learning rate finder starts training the model at a very low LR, increasing it every batch. The batch loss is plotted vs training steps and a figure is created from which a suitable LR range can be determined. This technique was first introduced by Leslie Smith in https://arxiv.org/abs/1506.01186.

jpata · 2021-06-29T11:32:29Z

I think the new pipeline and the reorganization are great!
Perhaps there is an opportunity to reduce code duplication further in the functions in pipeline (e.g. dataset, loss & model setup)?
Seems like tqdm should be added to the github action dependency as well (some help in modernizing it would be welcome, too!)

erwulff · 2021-06-29T11:39:19Z

I think the new pipeline and the reorganization are great!
Perhaps there is an opportunity to reduce code duplication further in the functions in pipeline (e.g. dataset, loss & model setup)?
Seems like tqdm should be added to the github action dependency as well (some help in modernizing it would be welcome, too!)

Thanks!
Yes, there is definitely an opportunity for further reduction of code duplication. It is on my to-do list.

When running `python mlpf/pipeline.py evaluate -t <train_dir>` without specifying which weights to use explicitly the pipeline will load the weights with the smallest loss in <train_dir>/weights/ that it can find.

This can be useful when many large checkpoint files take up too much storage space.

…nfig

The default parameters for expdecay added to the config files in this commit are the same as those used on the jpata/particleflow master branch at the time of writing.

Also: - Add missing parameters to config files. - Move make_weights_function to utils.py

jpata

Looks good, a small comment inline.

I think we could go ahead with this, and later follow up with a PR that completely moves all functionality to the new pipeline (I didn't try the new one yet, just made sure the old pipeline works as before).

Thanks a lot for the effort!

mlpf/tfmodel/model_setup.py

OneCycle LR, LR finder, custom Tensorboard, etc. Former-commit-id: 1e4c581

erwulff and others added 9 commits June 24, 2021 22:23

feat: pipeline - my alternative to the launcher

cf5f776

`mlpf/pipeline.py` is the beginning of a `click` based alternative to the `mlpf/launcher.py`.

fix: correct setting of global batch size

636dab6

Also add option to give a prefix to the name of the training directory

Merge branch 'jpata:master' into develop

62a6ba3

fix: do not silently scale learning rate with batch size

b896392

Also add lr_schedule parameter to configuration files

fix: do not silently scale learning rate with batch size

9fcba84

The previous commit still scaled the LR, this one fixes it.

refactoring to make pipeline.py simpler

2ea5c8b

- create get_train_val_datasets() function to get datasets for training - move targets_multi_output() from model_setup.py to utils.py for more flexible access (solving import loop issue)

fix: typo in OneCycleScheduler docstring

47db0d5

add installation of tqdm to github test

bb6fbdd

erwulff added 10 commits June 29, 2021 15:58

chore: reduce code duplication in mlpf/pipeline.py

771cc6f

chore: Reduction of code duplication

71b3534

feat: evaluate loads best weights in the <train_dir>/weights/

b6075d2

When running `python mlpf/pipeline.py evaluate -t <train_dir>` without specifying which weights to use explicitly the pipeline will load the weights with the smallest loss in <train_dir>/weights/ that it can find.

fix: Bugfix in loading of val data

73dcb01

fix: Bug in path handling

5147dde

feat: Add tests of pipeline for cms and delphes

4b31717

feat: Add command to delete all but best chekpoint weights

c326947

This can be useful when many large checkpoint files take up too much storage space.

fix: Use MeanSquaredLogarithmicError for pt and energy in OneCycle co…

fa71baa

…nfig

chore: Add description to find-lr command

456847c

feat: Add ability to configure expdecay parameters in config

79c8f95

The default parameters for expdecay added to the config files in this commit are the same as those used on the jpata/particleflow master branch at the time of writing.

erwulff marked this pull request as ready for review July 8, 2021 10:35

erwulff and others added 2 commits July 9, 2021 11:33

Merge branch 'master' into develop

7191747

Make eval_model() work for multi-output

b1b6e2e

Also: - Add missing parameters to config files. - Move make_weights_function to utils.py

jpata reviewed Jul 20, 2021

View reviewed changes

mlpf/tfmodel/model_setup.py Outdated Show resolved Hide resolved

erwulff added 3 commits July 22, 2021 10:17

fix: Creation of history dir now using parents=True, exist_ok=True

cd8ee25

fix: Always configue model weights before loading saved weights

e4c85d0

feat: Pipeline train copies config file to outdir for reference

ac3c944

jpata merged commit 1e4c581 into jpata:master Jul 23, 2021

jpata mentioned this pull request Aug 18, 2021

Learnable kernel, more monitoring plots for TF training #76

Merged

jpata added a commit that referenced this pull request Sep 15, 2023

Merge pull request #70 from erwulff/develop

7eb78b2

OneCycle LR, LR finder, custom Tensorboard, etc. Former-commit-id: 1e4c581

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OneCycle LR, LR finder, custom Tensorboard, etc. #70

OneCycle LR, LR finder, custom Tensorboard, etc. #70

erwulff commented Jun 29, 2021

jpata commented Jun 29, 2021

erwulff commented Jun 29, 2021

jpata left a comment

OneCycle LR, LR finder, custom Tensorboard, etc. #70

OneCycle LR, LR finder, custom Tensorboard, etc. #70

Conversation

erwulff commented Jun 29, 2021

OneCycleScheduler

Learning Rate Finder

Pipeline.py

Other notes

jpata commented Jun 29, 2021

erwulff commented Jun 29, 2021

jpata left a comment

Choose a reason for hiding this comment

`Pipeline.py`