Add multiclass_nms3 GPU kernel #52401

Tom-Zheng · 2023-03-31T06:49:11Z

PR types

New features

PR changes

OPs

Description

In this PR, we add a GPU kernel for multiclass_nms3 op, which could greatly speed up model evaluation for detection models.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config.
Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:

'nms_top_k': 1000,
'keep_top_k': 300,
'score_threshold': 0.01,
'nms_threshold': 0.7

Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

paddle-bot · 2023-03-31T06:49:15Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-ci-bot · 2023-04-24T03:25:27Z

Sorry to inform you that 2c8891f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

paddle-ci-bot · 2023-05-05T03:26:50Z

Sorry to inform you that 82779c7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

shaojiewang

Could you demonstrate some test cases with perf number?

shaojiewang · 2023-05-05T05:36:38Z

paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu

+      nmsedBoxes[i * 4 + 3] = clipBoxes ? saturate(yMax) : yMax;
+      nmsedIndices[i] = bboxId >> 2;
+      nmsedValidMask[i] = 1;
+      atomicAdd(&numDetections[i / keepTopK], 1);


TODO: we may also need a deterministic version.

The atomicAdd is performed on integer, so there is no determinism issue.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config.
Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:

'nms_top_k': 1000,

'keep_top_k': 300,

'score_threshold': 0.01,

'nms_threshold': 0.7

Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

The atomicAdd is performed on integer, so there is no determinism issue.

sure, got it.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config. Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640. Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400] Other parameters:

'nms_top_k': 1000,

'keep_top_k': 300,

'score_threshold': 0.01,

'nms_threshold': 0.7

Benchmark result: NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

could you put it into PR description?

shaojiewang

lgtm

python/paddle/fluid/tests/unittests/test_multiclass_nms_op.py

Xreki · 2023-05-06T03:29:47Z

paddle/fluid/operators/detection/multiclass_nms_op.cc

@@ -614,6 +614,13 @@ class MultiClassNMS3Op : public MultiClassNMS2Op {
                   const framework::VariableNameMap& outputs,
                   const framework::AttributeMap& attrs)
      : MultiClassNMS2Op(type, inputs, outputs, attrs) {}
+
+ protected:
+  phi::KernelKey GetExpectedKernelType(


phi体系下，指定Kernel选择的数据类型方式，可参考https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/legacy_ops.yaml#L129

这里是因为

Paddle/paddle/fluid/operators/detection/multiclass_nms_op.cc

Line 119 in 82779c7

platform::CPUPlace());

写死了返回CPU kernel，所以这里overload让它支持GPU kernel. 跟https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/legacy_ops.yaml#L129 似乎无关？

phi目录下的kernel是multiclass_nms3，这里重写multiclass_nms3的GetExpectedKernelType，也是为了指定依据哪个输入的数据类型来选Kernel。

paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu

Xreki · 2023-05-06T05:05:28Z

paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu

+  index->Resize({valid_samples, 1});
+  ctx.template Alloc<int>(index);
+  phi::funcs::GPUGatherNd<int, int64_t>(
+      ctx, nmsed_indices, valid_indices, index);


这个函数有197行代码，影响阅读，请考虑下进一步封装

其实大多数都是在做输入的准备以及输出的后处理，参数比较多所以显得长，我觉得不太好再封装了。我加了一些注释，请看是否可以。

paddle-bot · 2023-05-12T06:53:38Z

很抱歉，经过我们的反复讨论，你的PR暂未达到合入标准，请阅读飞桨原生算子开发规范，你可以重新提交新的PR，我们先将此PR关闭，感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

lyuwenyu

lgtm

Xreki

LGTM

qili93

LGTM for @unittest.skipIf

jerrywgz

LGTM

Tom-Zheng · 2023-05-22T02:16:00Z

@XiaoguangHu01 Would you please review this PR?

XiaoguangHu01

LGTM

Tom-Zheng · 2023-05-22T02:57:49Z

@Xreki I think we are OK to merge here.

* Add GPU kernel for multiclass_nms3 op * Make multiclass_nms3 gpu kernel output consistent with cpu kernel * Fix API incompatibility * Fix unittests on builds without CUDA * Fix ROCM build * Remove fluid headers; Use default atol for unittest * Change function and variable naming * Add comments; Reduce redundant code * Use paddle test framework

paddle-bot bot added contributor External developers status: proposed labels Mar 31, 2023

Tom-Zheng added NVIDIA and removed contributor External developers status: proposed labels Mar 31, 2023

Tom-Zheng requested a review from lyuwenyu March 31, 2023 06:49

paddle-bot bot added the contributor External developers label Mar 31, 2023

onecatcn assigned lyuwenyu Apr 3, 2023

lyuwenyu previously approved these changes Apr 5, 2023

View reviewed changes

onecatcn assigned Xreki and unassigned lyuwenyu Apr 6, 2023

Tom-Zheng dismissed lyuwenyu’s stale review via ccf873c April 11, 2023 03:41

Tom-Zheng force-pushed the tizheng/add_nms3_gpu_kernel_pr branch from 8ffd9d3 to 78255fa Compare April 13, 2023 09:52

Tom-Zheng added 8 commits April 27, 2023 03:40

Add GPU kernel for multiclass_nms3 op

29c50c5

Make multiclass_nms3 gpu kernel output consistent with cpu kernel

24b34d8

Fix API incompatibility

ca3bed4

Fix unittests on builds without CUDA

edff3ab

Fix ROCM build

4d36648

Fix ROCM build: cmake

1a809bb

Fix CI build error

6281bc0

Fix ROCM build

82779c7

Tom-Zheng force-pushed the tizheng/add_nms3_gpu_kernel_pr branch from 2c8891f to 82779c7 Compare April 27, 2023 03:40

shaojiewang reviewed May 5, 2023

View reviewed changes

Tom-Zheng added the status: proposed label May 5, 2023

shaojiewang previously approved these changes May 5, 2023

View reviewed changes

Xreki reviewed May 6, 2023

View reviewed changes

Remove fluid headers; Use default atol for unittest

ce3ac88

Tom-Zheng dismissed shaojiewang’s stale review via ce3ac88 May 6, 2023 06:40

Tom-Zheng added 2 commits May 6, 2023 08:05

Change function and variable naming

2299b13

Add comments; Reduce redundant code

fb2a9fa

Tom-Zheng closed this May 12, 2023

Tom-Zheng reopened this May 12, 2023

paddle-bot bot added status: not progressed and removed status: proposed labels May 12, 2023

Tom-Zheng added 2 commits May 16, 2023 14:19

Undo removing CPU kernel to impl.h

9049282

Use paddle test framework

2b0d8b8

lyuwenyu approved these changes May 18, 2023

View reviewed changes

Xreki approved these changes May 19, 2023

View reviewed changes

qili93 approved these changes May 19, 2023

View reviewed changes

jerrywgz approved these changes May 22, 2023

View reviewed changes

XiaoguangHu01 approved these changes May 22, 2023

View reviewed changes

Xreki merged commit f71c805 into PaddlePaddle:develop May 22, 2023

This was referenced May 22, 2023

fix infer CE PaddlePaddle/continuous_integration#277

Closed

fix infer ce PaddlePaddle/continuous_integration#278

Merged

jeng1220 mentioned this pull request May 25, 2023

數個重要修正需要 cherry-pick 到 release/2.5 branch #54100

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiclass_nms3 GPU kernel #52401

Add multiclass_nms3 GPU kernel #52401

Tom-Zheng commented Mar 31, 2023 •

edited

Loading

paddle-bot bot commented Mar 31, 2023

paddle-ci-bot bot commented Apr 24, 2023

paddle-ci-bot bot commented May 5, 2023

shaojiewang left a comment

shaojiewang May 5, 2023

Tom-Zheng May 5, 2023

Tom-Zheng May 5, 2023 •

edited

Loading

shaojiewang May 5, 2023

shaojiewang May 5, 2023

Tom-Zheng May 6, 2023

shaojiewang left a comment

Xreki May 6, 2023

Tom-Zheng May 6, 2023

Xreki May 11, 2023

Xreki May 6, 2023

Tom-Zheng May 6, 2023

paddle-bot bot commented May 12, 2023

lyuwenyu left a comment

Xreki left a comment

qili93 left a comment

jerrywgz left a comment

Tom-Zheng commented May 22, 2023

XiaoguangHu01 left a comment

Tom-Zheng commented May 22, 2023

Add multiclass_nms3 GPU kernel #52401

Add multiclass_nms3 GPU kernel #52401

Conversation

Tom-Zheng commented Mar 31, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Mar 31, 2023

paddle-ci-bot bot commented Apr 24, 2023

paddle-ci-bot bot commented May 5, 2023

shaojiewang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tom-Zheng May 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shaojiewang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-bot bot commented May 12, 2023

lyuwenyu left a comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

qili93 left a comment

Choose a reason for hiding this comment

jerrywgz left a comment

Choose a reason for hiding this comment

Tom-Zheng commented May 22, 2023

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Tom-Zheng commented May 22, 2023

Tom-Zheng commented Mar 31, 2023 •

edited

Loading

Tom-Zheng May 5, 2023 •

edited

Loading