-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142
Conversation
erwulff
commented
Sep 15, 2022
•
edited
Loading
edited
- Fix automatic saving and loading of model and optimizer checkpoints in Ray Tune runs
- Allow to use LR schedules when using optimizers other than PCGrad
- Add JUERCA sbatch script for multi-node Horovod training
- Use comet offline logging in Ray Tune experiments
- Etc.
Merge new commits from jpata:master
Merge from jpata/particleflow master
Merge jpata/master into master
Merge latest developments
merge jpata/particleflow master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge from my side.
@@ -194,7 +204,7 @@ def train(config, weights, ntrain, ntest, nepochs, recreate, prefix, plot_freq, | |||
model.fit( | |||
ds_train.repeat(), | |||
validation_data=ds_test.repeat(), | |||
epochs=initial_epoch + config["setup"]["num_epochs"], | |||
epochs=config["setup"]["num_epochs"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this? I think it was correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is the following. Let's say config["setup"]["num_epochs"]
is 100
and we resume an interrupted training from epoch 20
. Then initial_epoch
will be 20
and initial_epoch + config["setup"]["num_epochs"]
will be 120, right? I think it's more intuitive that config["setup"]["num_epochs"]
should be the total number of epochs to run before completing the training, rather than the additional number of epochs to run from the resumed point. This is a matter of taste I suppose. What do you think?
Thanks for this, looks good. So I agree to change the meaning of epochs/nepochs from "train X more epochs" to "train up to X epochs". |
…d more. (#142) * feat: add option to include SLURM jobid name in training dir * feat: add command-line option to enable horovod * feat: Use comet offline logging in Ray Tune runs * fix: bug in raytune command * fix: handle TF version-dependent names of the legacy optimizer * feat: add event and met losses to raytune search space * feat: added sbatch script for Horovod training on JURECA * fix: Ray Tune checkpoint saving and loading * feat: allow lr schedules when not using PCGrad * chore: add print of loaded opt weights * fix: handle TF version-dependent names of the legacy optimizer Former-commit-id: deb05ea