Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiclass_nms3 GPU kernel #52401

Merged
merged 13 commits into from
May 22, 2023

Conversation

Tom-Zheng
Copy link
Contributor

@Tom-Zheng Tom-Zheng commented Mar 31, 2023

PR types

New features

PR changes

OPs

Description

In this PR, we add a GPU kernel for multiclass_nms3 op, which could greatly speed up model evaluation for detection models.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config.
Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:

  • 'nms_top_k': 1000,
  • 'keep_top_k': 300,
  • 'score_threshold': 0.01,
  • 'nms_threshold': 0.7

Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

@paddle-bot
Copy link

paddle-bot bot commented Mar 31, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Mar 31, 2023
@Tom-Zheng Tom-Zheng added NVIDIA and removed contributor External developers status: proposed labels Mar 31, 2023
@Tom-Zheng Tom-Zheng requested a review from lyuwenyu March 31, 2023 06:49
@paddle-bot paddle-bot bot added the contributor External developers label Mar 31, 2023
lyuwenyu
lyuwenyu previously approved these changes Apr 5, 2023
@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Apr 24, 2023

Sorry to inform you that 2c8891f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@Tom-Zheng Tom-Zheng force-pushed the tizheng/add_nms3_gpu_kernel_pr branch from 2c8891f to 82779c7 Compare April 27, 2023 03:40
@paddle-ci-bot
Copy link

paddle-ci-bot bot commented May 5, 2023

Sorry to inform you that 82779c7's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@shaojiewang shaojiewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you demonstrate some test cases with perf number?

nmsedBoxes[i * 4 + 3] = clipBoxes ? saturate(yMax) : yMax;
nmsedIndices[i] = bboxId >> 2;
nmsedValidMask[i] = 1;
atomicAdd(&numDetections[i / keepTopK], 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we may also need a deterministic version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomicAdd is performed on integer, so there is no determinism issue.

Copy link
Contributor Author

@Tom-Zheng Tom-Zheng May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config.
Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640.
Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400]
Other parameters:

  • 'nms_top_k': 1000,
  • 'keep_top_k': 300,
  • 'score_threshold': 0.01,
  • 'nms_threshold': 0.7

Benchmark result:
NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomicAdd is performed on integer, so there is no determinism issue.

sure, got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We benchmarked in PP-YOLOE+ evaluation, with ppyoloe_plus_crn_l_80e_coco.yml config. Setting: A100-PCIE-80GB; batch_size=32; evaluate size = 640 x 640. Problem size of NMS OP: shape of bbox: [32, 8400, 4]; shape of scores: [32, 80, 8400] Other parameters:

  • 'nms_top_k': 1000,
  • 'keep_top_k': 300,
  • 'score_threshold': 0.01,
  • 'nms_threshold': 0.7

Benchmark result: NMS OP time: 2295 ms (CPU) -> 0.267 ms (GPU) ; speedup: 8595.5x

could you put it into PR description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

shaojiewang
shaojiewang previously approved these changes May 5, 2023
Copy link
Contributor

@shaojiewang shaojiewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -614,6 +614,13 @@ class MultiClassNMS3Op : public MultiClassNMS2Op {
const framework::VariableNameMap& outputs,
const framework::AttributeMap& attrs)
: MultiClassNMS2Op(type, inputs, outputs, attrs) {}

protected:
phi::KernelKey GetExpectedKernelType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phi体系下,指定Kernel选择的数据类型方式,可参考https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/legacy_ops.yaml#L129

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为

写死了返回CPU kernel, 所以这里overload让它支持GPU kernel. 跟https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/legacy_ops.yaml#L129 似乎无关?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phi目录下的kernel是multiclass_nms3,这里重写multiclass_nms3的GetExpectedKernelType,也是为了指定依据哪个输入的数据类型来选Kernel。

paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu Outdated Show resolved Hide resolved
index->Resize({valid_samples, 1});
ctx.template Alloc<int>(index);
phi::funcs::GPUGatherNd<int, int64_t>(
ctx, nmsed_indices, valid_indices, index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数有197行代码,影响阅读,请考虑下进一步封装

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实大多数都是在做输入的准备以及输出的后处理,参数比较多所以显得长,我觉得不太好再封装了。我加了一些注释,请看是否可以。

@paddle-bot
Copy link

paddle-bot bot commented May 12, 2023

很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。
Sorry to inform you that through our discussion, your PR fails to meet the merging standard (Reference: Paddle Custom Operator Design Doc). You can also submit an new one. Thank you.

Copy link
Contributor

@lyuwenyu lyuwenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qili93 qili93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for @unittest.skipIf

Copy link
Contributor

@jerrywgz jerrywgz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tom-Zheng
Copy link
Contributor Author

@XiaoguangHu01 Would you please review this PR?

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tom-Zheng
Copy link
Contributor Author

@Xreki I think we are OK to merge here.

@Xreki Xreki merged commit f71c805 into PaddlePaddle:develop May 22, 2023
bukejiyu pushed a commit to bukejiyu/Paddle that referenced this pull request May 22, 2023
* Add GPU kernel for multiclass_nms3 op

* Make multiclass_nms3 gpu kernel output consistent with cpu kernel

* Fix API incompatibility

* Fix unittests on builds without CUDA

* Fix ROCM build

* Remove fluid headers; Use default atol for unittest

* Change function and variable naming

* Add comments; Reduce redundant code

* Use paddle test framework
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers NVIDIA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants