pytorch backend major update #240

farakiko · 2023-10-16T16:15:49Z

Checks and updates to the pytorch code,

Fix the jet reconstruction and inference code
Run multiple validation runs within a single epoch to avoid waiting for a full epoch
Avoid putting the loss values on cpu() after every training step
Make --num-workers configurable from the args
Other formatting

The PR is tested through

Training on 100Top and 100QCD events for 300 epochs on 4 NVIDIA GeForce GTX 1080 Ti Gpus using the following command (eta ~90min):
python mlpf/pyg_pipeline.py --dataset cms --config parameters/pyg-cms-small.yaml --gpus "0,1,2,3" --prefix /pfvol/experiments/MLPF_cms_small_PR_ --conv-type gravnet --num-epochs 300 --train --ntrain 100 --nvalid 100 --gpu-batch-multiplier 5
Testing on 500 high-pT QCD events 4 NVIDIA GeForce GTX 1080 Ti Gpus using the following command (eta ~18min):
python mlpf/pyg_pipeline.py --dataset cms --config parameters/pyg-cms-test-qcdhighpt.yaml --gpus "0,1,2,3" --load "" --test --ntest 500 --make-plots --gpu-batch-multiplier 5

Results are shown below,
mlpf_loss_Total.pdf
jet_res.pdf
met_res.pdf

A few notes for the next update regarding multgpu training

Must fix num-workers>0 for gpus>1 (atm runs into error)
Broadcast stale_epochs to all gpus to stop the training in an optimal way (the code hangs at the moment when stale_epochs>patience)
Careful look at dist.barrier() invokes during training to avoid bottlenecks

jpata · 2023-10-19T15:30:16Z

@erwulff could you take a look at the PR also?

* best configs * fix jetdef * fix val loss high values * fix loss on cpu bottleneck * add num-workers and prefetch factors to args

farakiko added 30 commits October 16, 2023 15:29

push

6ffe3a3

up for now

367d54c

UP

f807833

up

631ea17

up

337a942

up

7bc8c4d

up

adcd695

up

55c102d

up

ea56b63

up

4b47d78

up

de51b48

up

37b956a

up

52e3dc0

up

833fc11

up

48ff426

up

1b0fc61

up

c274aa9

up

625fef7

up

b5b92a1

up

700fc30

up

3e8000e

up

34dc3fc

up

27e0c74

up

010ad51

up

54d8127

up

7fad80b

up

0c583a1

up

af67bb6

up

5d9f8d8

up best configs

15ce200

farakiko added 6 commits October 18, 2023 20:21

up

b3679e6

up

1cd4a93

up

e26a6dc

up

0eb799d

up

b2795b4

up

cf1f26d

farakiko changed the title ~~Pyg~~ Pyg final major update Oct 18, 2023

farakiko added 9 commits October 18, 2023 20:46

up

e6f7e6f

up

1317c17

up

1300c30

up

1605880

up

557f47e

up

28d1252

up

e5296f2

up

4fc1f5e

up

d018574

jpata requested a review from erwulff October 19, 2023 15:29

up

76558d1

jpata changed the title ~~Pyg final major update~~ pytorch backend major update Oct 20, 2023

farakiko added 6 commits October 20, 2023 11:08

fix loss on cpu bottleneck

edc0e80

add num-workers and prefetch factors to args

63d064d

up

2e86f7d

up

d0d53e6

up

0453aee

up

11233d4

erwulff approved these changes Oct 20, 2023

View reviewed changes

jpata merged commit 36d8584 into jpata:main Oct 20, 2023
10 checks passed

farakiko mentioned this pull request Oct 23, 2023

Update dist.barrier() and fix stale epochs for torch backend #249

Merged

farakiko added a commit to farakiko/particleflow that referenced this pull request Jan 23, 2024

pytorch backend major update (jpata#240)

2f52bfe

* best configs * fix jetdef * fix val loss high values * fix loss on cpu bottleneck * add num-workers and prefetch factors to args

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch backend major update #240

pytorch backend major update #240

farakiko commented Oct 16, 2023 •

edited

Loading

jpata commented Oct 19, 2023

pytorch backend major update #240

pytorch backend major update #240

Conversation

farakiko commented Oct 16, 2023 • edited Loading

jpata commented Oct 19, 2023

farakiko commented Oct 16, 2023 •

edited

Loading