Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

erwulff · 2022-09-15T09:41:56Z

Fix automatic saving and loading of model and optimizer checkpoints in Ray Tune runs
Allow to use LR schedules when using optimizers other than PCGrad
Add JUERCA sbatch script for multi-node Horovod training
Use comet offline logging in Ray Tune experiments
Etc.

Merge new commits from jpata:master

Merge from jpata/particleflow master

Merge jpata/master into master

Merge latest developments

merge jpata/particleflow master

erwulff

Ready to merge from my side.

jpata · 2022-09-16T10:06:05Z

mlpf/pipeline.py

@@ -194,7 +204,7 @@ def train(config, weights, ntrain, ntest, nepochs, recreate, prefix, plot_freq,
    model.fit(
        ds_train.repeat(),
        validation_data=ds_test.repeat(),
-        epochs=initial_epoch + config["setup"]["num_epochs"],
+        epochs=config["setup"]["num_epochs"],


why did you change this? I think it was correct.

My thinking is the following. Let's say config["setup"]["num_epochs"] is 100 and we resume an interrupted training from epoch 20. Then initial_epoch will be 20 and initial_epoch + config["setup"]["num_epochs"] will be 120, right? I think it's more intuitive that config["setup"]["num_epochs"] should be the total number of epochs to run before completing the training, rather than the additional number of epochs to run from the resumed point. This is a matter of taste I suppose. What do you think?

jpata · 2022-09-20T06:10:05Z

Thanks for this, looks good.

So I agree to change the meaning of epochs/nepochs from

"train X more epochs" to "train up to X epochs".

…d more. (#142) * feat: add option to include SLURM jobid name in training dir * feat: add command-line option to enable horovod * feat: Use comet offline logging in Ray Tune runs * fix: bug in raytune command * fix: handle TF version-dependent names of the legacy optimizer * feat: add event and met losses to raytune search space * feat: added sbatch script for Horovod training on JURECA * fix: Ray Tune checkpoint saving and loading * feat: allow lr schedules when not using PCGrad * chore: add print of loaded opt weights * fix: handle TF version-dependent names of the legacy optimizer Former-commit-id: deb05ea

erwulff and others added 30 commits July 9, 2021 11:11

Merge pull request #1 from jpata/master

1207b06

Merge new commits from jpata:master

Merge pull request #2 from jpata/master

12fa88d

Merge from jpata/particleflow master

Merge branch 'jpata:master' into master

827c110

Merge pull request #3 from jpata/master

e8a44d3

Merge jpata/master into master

Merge branch 'jpata:master' into master

f87b3ad

Merge pull request #4 from jpata/master

9c8502a

Merge latest developments

Merge branch 'jpata:master' into master

2126c38

Merge branch 'jpata:master' into master

ab9a4b9

Merge branch 'jpata:master' into master

c86a23f

Merge branch 'jpata:master' into master

c944d87

Merge pull request #6 from jpata/master

1c6189f

merge jpata/particleflow master

Merge branch 'jpata:master' into master

43cca5d

Merge branch 'jpata:master' into master

85ef3ba

Merge branch 'jpata:master' into master

d8e0aac

Merge branch 'jpata:master' into master

f132bbc

Merge branch 'jpata:master' into master

65179c6

Merge branch 'jpata:master' into master

914a709

Merge branch 'jpata:master' into master

830c012

Merge branch 'master' of github.com:erwulff/particleflow

59df47a

Merge branch 'master' of github.com:erwulff/particleflow

9adbe9d

Merge branch 'master' of github.com:erwulff/particleflow

0820179

Merge branch 'master' of github.com:erwulff/particleflow

0957c5d

Merge branch 'master' of github.com:erwulff/particleflow

738c094

feat: add option to include SLURM jobid name in training dir

3e77186

feat: add command-line option to enable horovod

c63e8e7

feat: Use comet offline logging in Ray Tune runs

afa0c8f

fix: bug in raytune command

392954b

fix: handle TF version-dependent names of the legacy optimizer

d961a15

feat: add event and met losses to raytune search space

f2effd5

feat: added sbatch script for Horovod training on JURECA

db774dc

erwulff added 2 commits September 15, 2022 18:33

fix: Ray Tune checkpoint saving and loading

3ac1094

feat: allow lr schedules when not using PCGrad

b42c462

erwulff changed the title ~~Erwulff 220915 dev~~ Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. Sep 15, 2022

erwulff marked this pull request as ready for review September 16, 2022 08:41

chore: add print of loaded opt weights

c74ecc2

erwulff requested a review from jpata September 16, 2022 09:14

erwulff commented Sep 16, 2022

View reviewed changes

jpata reviewed Sep 16, 2022

View reviewed changes

fix: handle TF version-dependent names of the legacy optimizer

9efab4a

jpata merged commit deb05ea into jpata:main Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

erwulff commented Sep 15, 2022 •

edited

Loading

erwulff left a comment

jpata Sep 16, 2022

erwulff Sep 16, 2022

jpata commented Sep 20, 2022

Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

Conversation

erwulff commented Sep 15, 2022 • edited Loading

erwulff left a comment

Choose a reason for hiding this comment

jpata Sep 16, 2022

Choose a reason for hiding this comment

erwulff Sep 16, 2022

Choose a reason for hiding this comment

jpata commented Sep 20, 2022

erwulff commented Sep 15, 2022 •

edited

Loading