Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch backend major update #240

Merged
merged 105 commits into from
Oct 20, 2023
Merged

pytorch backend major update #240

merged 105 commits into from
Oct 20, 2023

Conversation

farakiko
Copy link
Collaborator

@farakiko farakiko commented Oct 16, 2023

Checks and updates to the pytorch code,

  • Fix the jet reconstruction and inference code
  • Run multiple validation runs within a single epoch to avoid waiting for a full epoch
  • Avoid putting the loss values on cpu() after every training step
  • Make --num-workers configurable from the args
  • Other formatting

The PR is tested through

  1. Training on 100Top and 100QCD events for 300 epochs on 4 NVIDIA GeForce GTX 1080 Ti Gpus using the following command (eta ~90min):
    python mlpf/pyg_pipeline.py --dataset cms --config parameters/pyg-cms-small.yaml --gpus "0,1,2,3" --prefix /pfvol/experiments/MLPF_cms_small_PR_ --conv-type gravnet --num-epochs 300 --train --ntrain 100 --nvalid 100 --gpu-batch-multiplier 5

  2. Testing on 500 high-pT QCD events 4 NVIDIA GeForce GTX 1080 Ti Gpus using the following command (eta ~18min):
    python mlpf/pyg_pipeline.py --dataset cms --config parameters/pyg-cms-test-qcdhighpt.yaml --gpus "0,1,2,3" --load "" --test --ntest 500 --make-plots --gpu-batch-multiplier 5

Results are shown below,
mlpf_loss_Total.pdf
jet_res.pdf
met_res.pdf
zzz 2

A few notes for the next update regarding multgpu training

  • Must fix num-workers>0 for gpus>1 (atm runs into error)
  • Broadcast stale_epochs to all gpus to stop the training in an optimal way (the code hangs at the moment when stale_epochs>patience)
  • Careful look at dist.barrier() invokes during training to avoid bottlenecks

@farakiko farakiko changed the title Pyg Pyg final major update Oct 18, 2023
@jpata jpata requested a review from erwulff October 19, 2023 15:29
@jpata
Copy link
Owner

jpata commented Oct 19, 2023

@erwulff could you take a look at the PR also?

@jpata jpata changed the title Pyg final major update pytorch backend major update Oct 20, 2023
@jpata jpata merged commit 36d8584 into jpata:main Oct 20, 2023
10 checks passed
farakiko added a commit to farakiko/particleflow that referenced this pull request Jan 23, 2024
* best configs

* fix jetdef

* fix val loss high values

* fix loss on cpu bottleneck

* add num-workers and prefetch factors to args
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants