Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142

Merged
merged 34 commits into from
Sep 20, 2022

Conversation

erwulff
Copy link
Collaborator

@erwulff erwulff commented Sep 15, 2022

  • Fix automatic saving and loading of model and optimizer checkpoints in Ray Tune runs
  • Allow to use LR schedules when using optimizers other than PCGrad
  • Add JUERCA sbatch script for multi-node Horovod training
  • Use comet offline logging in Ray Tune experiments
  • Etc.

erwulff and others added 30 commits July 9, 2021 11:11
Merge new commits from jpata:master
Merge from jpata/particleflow master
Merge jpata/master into master
Merge latest developments
merge jpata/particleflow master
@erwulff erwulff changed the title Erwulff 220915 dev Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. Sep 15, 2022
@erwulff erwulff marked this pull request as ready for review September 16, 2022 08:41
@erwulff erwulff requested a review from jpata September 16, 2022 09:14
Copy link
Collaborator Author

@erwulff erwulff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge from my side.

@@ -194,7 +204,7 @@ def train(config, weights, ntrain, ntest, nepochs, recreate, prefix, plot_freq,
model.fit(
ds_train.repeat(),
validation_data=ds_test.repeat(),
epochs=initial_epoch + config["setup"]["num_epochs"],
epochs=config["setup"]["num_epochs"],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this? I think it was correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is the following. Let's say config["setup"]["num_epochs"] is 100 and we resume an interrupted training from epoch 20. Then initial_epoch will be 20 and initial_epoch + config["setup"]["num_epochs"] will be 120, right? I think it's more intuitive that config["setup"]["num_epochs"] should be the total number of epochs to run before completing the training, rather than the additional number of epochs to run from the resumed point. This is a matter of taste I suppose. What do you think?

@jpata
Copy link
Owner

jpata commented Sep 20, 2022

Thanks for this, looks good.

So I agree to change the meaning of epochs/nepochs from

"train X more epochs" to "train up to X epochs".

@jpata jpata merged commit deb05ea into jpata:main Sep 20, 2022
jpata pushed a commit that referenced this pull request Sep 15, 2023
…d more. (#142)

* feat: add option to include SLURM jobid name in training dir

* feat: add command-line option to enable horovod

* feat: Use comet offline logging in Ray Tune runs

* fix: bug in raytune command

* fix: handle TF version-dependent names of the legacy optimizer

* feat: add event and met losses to raytune search space

* feat: added sbatch script for Horovod training on JURECA

* fix: Ray Tune checkpoint saving and loading

* feat: allow lr schedules when not using PCGrad

* chore: add print of loaded opt weights

* fix: handle TF version-dependent names of the legacy optimizer

Former-commit-id: deb05ea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants