Provide an efficient inference implementation using sparsification/quantization #206

jpata · 2023-09-14T10:17:02Z

Goal: reduce inference time of the model using quantization

We made some CPU inference performance results public for 2021 in CMS, https://cds.cern.ch/record/2792320/files/DP2021_030.pdf slide 16, “For context, on a single CPU thread (Intel i7-10700 @ 2.9GHz), the baseline PF requires approximately (9 ± 5) ms, the MLPF model approximately 320 ± 50 ms for Run 3 ttbar MC events”.

Now it's a good time to make the inference as fast as possible, while minimizing any physics impact.

Resources:

jpata · 2023-09-29T15:10:20Z

adding @raj2022

jpata · 2024-04-30T11:14:31Z

Also related: #315

jpata · 2024-05-27T15:36:54Z

Basically, to summarize:

with @raj2022 we saw that it's possible to quantize the model to int8 in pytorch using post-training stating quantization, following the recipe in https://github.com/jpata/particleflow/blob/main/notebooks/clic/mlpf-pytorch-transformer-standalone.ipynb
the important features were a custom attention layer (in the notebook), and introducing per-feature quantization stubs
we also showed that using just relu, it's possible to train a very performant model, therefore this work improved the compute budget
however, the int8 exported model was not faster neither on CPU nor on GPU
this most likely requires a more informed approach to make sure the int8 attention is actually computed using efficient ops on the hardware
the summary notebook was added in normalize loss, reparametrize network #297
ONNX may be a better path for performant quantization in the end, but this requires more study.

I'm closing this issue, and putting it on the roadmap to study ONNX post-training static quantization separately.
Many thanks to @raj2022 for your contributions!

jpata changed the title ~~Provide an efficient inference implementation using sparsification/quantization~~ Provide an efficient GNN inference implementation using sparsification/quantization Sep 14, 2023

jpata changed the title ~~Provide an efficient GNN inference implementation using sparsification/quantization~~ Provide an efficient GNN inference implementation using sparsification/quantization with ONNX Sep 29, 2023

jpata mentioned this issue Sep 29, 2023

try quantization, both post-training and quantization-aware training #224

Closed

jpata added hard enhancement New feature or request labels Oct 12, 2023

jpata changed the title ~~Provide an efficient GNN inference implementation using sparsification/quantization with ONNX~~ Provide an efficient inference implementation using sparsification/quantization Apr 11, 2024

jpata closed this as completed May 27, 2024

jpata assigned jpata and unassigned jpata May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an efficient inference implementation using sparsification/quantization #206

Provide an efficient inference implementation using sparsification/quantization #206

jpata commented Sep 14, 2023 •

edited

Loading

jpata commented Sep 29, 2023

jpata commented Apr 30, 2024

jpata commented May 27, 2024 •

edited

Loading

Provide an efficient inference implementation using sparsification/quantization #206

Provide an efficient inference implementation using sparsification/quantization #206

Comments

jpata commented Sep 14, 2023 • edited Loading

jpata commented Sep 29, 2023

jpata commented Apr 30, 2024

jpata commented May 27, 2024 • edited Loading

jpata commented Sep 14, 2023 •

edited

Loading

jpata commented May 27, 2024 •

edited

Loading