-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Tune checkpointing fix, allow LR schedules for non-PCGrad opt, and more. #142
Merged
Merged
Changes from all commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
1207b06
Merge pull request #1 from jpata/master
erwulff 12fa88d
Merge pull request #2 from jpata/master
erwulff 827c110
Merge branch 'jpata:master' into master
erwulff e8a44d3
Merge pull request #3 from jpata/master
erwulff f87b3ad
Merge branch 'jpata:master' into master
erwulff 9c8502a
Merge pull request #4 from jpata/master
erwulff 2126c38
Merge branch 'jpata:master' into master
erwulff ab9a4b9
Merge branch 'jpata:master' into master
erwulff c86a23f
Merge branch 'jpata:master' into master
erwulff c944d87
Merge branch 'jpata:master' into master
erwulff 1c6189f
Merge pull request #6 from jpata/master
erwulff 43cca5d
Merge branch 'jpata:master' into master
erwulff 85ef3ba
Merge branch 'jpata:master' into master
erwulff d8e0aac
Merge branch 'jpata:master' into master
erwulff f132bbc
Merge branch 'jpata:master' into master
erwulff 65179c6
Merge branch 'jpata:master' into master
erwulff 914a709
Merge branch 'jpata:master' into master
erwulff 830c012
Merge branch 'jpata:master' into master
erwulff 59df47a
Merge branch 'master' of github.com:erwulff/particleflow
erwulff 9adbe9d
Merge branch 'master' of github.com:erwulff/particleflow
erwulff 0820179
Merge branch 'master' of github.com:erwulff/particleflow
erwulff 0957c5d
Merge branch 'master' of github.com:erwulff/particleflow
erwulff 738c094
Merge branch 'master' of github.com:erwulff/particleflow
erwulff 3e77186
feat: add option to include SLURM jobid name in training dir
erwulff c63e8e7
feat: add command-line option to enable horovod
erwulff afa0c8f
feat: Use comet offline logging in Ray Tune runs
erwulff 392954b
fix: bug in raytune command
erwulff d961a15
fix: handle TF version-dependent names of the legacy optimizer
erwulff f2effd5
feat: add event and met losses to raytune search space
erwulff db774dc
feat: added sbatch script for Horovod training on JURECA
erwulff 3ac1094
fix: Ray Tune checkpoint saving and loading
erwulff b42c462
feat: allow lr schedules when not using PCGrad
erwulff c74ecc2
chore: add print of loaded opt weights
erwulff 9efab4a
fix: handle TF version-dependent names of the legacy optimizer
erwulff File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/bin/sh | ||
|
||
#SBATCH --account=raise-ctp2 | ||
#SBATCH --partition=dc-gpu | ||
#SBATCH --time 24:00:00 | ||
#SBATCH --nodes 1 | ||
#SBATCH --tasks-per-node=1 | ||
#SBATCH --gres=gpu:4 | ||
|
||
# Job name | ||
#SBATCH -J pipehorovod | ||
|
||
# Output and error logs | ||
#SBATCH -o logs_slurm/log_%x_%j.out | ||
#SBATCH -e logs_slurm/log_%x_%j.err | ||
|
||
# Add jobscript to job output | ||
echo "#################### Job submission script. #############################" | ||
cat $0 | ||
echo "################# End of job submission script. #########################" | ||
|
||
|
||
module --force purge | ||
module load Stages/2022 | ||
module load GCC GCCcore/.11.2.0 CMake NCCL CUDA cuDNN OpenMPI | ||
|
||
export CUDA_VISIBLE_DEVICES=0,1,2,3 | ||
jutil env activate -p raise-ctp2 | ||
|
||
sleep 1 | ||
nvidia-smi | ||
|
||
source /p/project/raise-ctp2/cern/miniconda3/bin/activate tf2 | ||
echo "Python used:" | ||
which python3 | ||
python3 --version | ||
|
||
|
||
echo "DEBUG: SLURM_JOB_ID: $SLURM_JOB_ID" | ||
echo "DEBUG: SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST" | ||
echo "DEBUG: SLURM_NNODES: $SLURM_NNODES" | ||
echo "DEBUG: SLURM_NTASKS: $SLURM_NTASKS" | ||
echo "DEBUG: SLURM_TASKS_PER_NODE: $SLURM_TASKS_PER_NODE" | ||
echo "DEBUG: SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST" | ||
echo "DEBUG: SLURMD_NODENAME: $SLURMD_NODENAME" | ||
echo "DEBUG: SLURM_NODEID: $SLURM_NODEID" | ||
echo "DEBUG: SLURM_LOCALID: $SLURM_LOCALID" | ||
echo "DEBUG: SLURM_PROCID: $SLURM_PROCID" | ||
echo "DEBUG: CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES" | ||
|
||
|
||
export NCCL_DEBUG=INFO | ||
export OMP_NUM_THREADS=1 | ||
if [ "$SLURM_CPUS_PER_TASK" > 0 ] ; then | ||
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK | ||
fi | ||
echo $OMP_NUM_THREADS | ||
|
||
|
||
echo 'Starting training.' | ||
srun --cpu-bind=none python mlpf/pipeline.py train -c $1 -p $2 --comet-offline -j $SLURM_JOBID -m | ||
echo 'Training done.' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this? I think it was correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is the following. Let's say
config["setup"]["num_epochs"]
is100
and we resume an interrupted training from epoch20
. Theninitial_epoch
will be20
andinitial_epoch + config["setup"]["num_epochs"]
will be 120, right? I think it's more intuitive thatconfig["setup"]["num_epochs"]
should be the total number of epochs to run before completing the training, rather than the additional number of epochs to run from the resumed point. This is a matter of taste I suppose. What do you think?