New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimize where_index_op(prefix sum) #30601

Merged

wzzju merged 14 commits into PaddlePaddle:develop from thisjiang:optimize-whereindexop-prefix

Apr 26, 2021

Contributor

thisjiang commented Jan 20, 2021 •

edited

Loading

PR types

Performance optimization

PR changes

OPs

Describe

优化

优化效果:

修改	ips
原始版本	5.55 steps/s
优化1	5.67 steps/s
优化2	5.68 steps/s
优化3	5.71 steps/s

竞品对比（测试数据大小[100, 100, 100]）：

`V100-SXM2-32GB`机器循环100次	`tf.where`	`torch.nonzero`	paddle.where优化前	优化0PR30556	优化1	优化2	优化3
cost	2.704 s	0.016 s	1.959 s	0.343 s	0.032	0.030 s	0.025 s

可读性性优化d7fd2da：

优化变量名并增加若干注释

优化345442eb：

优化方法：

使用cub库的cub::DeviceScan::InclusiveSum方法替代自写的不完美的prefix sum kernel

优化点：

将所有临时变量的alloc操作移到kernel前
将自写的KeScanPrefixSum kernel替换为cub库带的cub::DeviceScan::InclusiveSum，提高可靠性
只out->Resize一次，用InclusiveSum而非ExclusiveSum的一个原因也是为了可以直接得到true_num
KeGetTrueNum和KeSetTrueIndex此时都做到了访存合并

优化2cde9c19：

优化方法：

自写prefix sum kernel代替thrust::inclusive_scan

优化点：

根据Parallel Prefix Sum (Scan) with CUDA论文的方法自写了一个计算前缀和kernel的kernel，避免了因thrust::inclusive_scan的defalut stream导致的不稳定性
将步骤分为了4个kernel：
- 第一个kernel根据是否为true设置对应位为0或1；
- 第二个kernel计算每个block的前缀和，这里就可以得到true_index的位置了；
- 第三个kernel将每个block的前缀加的base_index计算出来；
- 最后一个kernel根据true_index加base_index得到index放到out的对应位置。
  尽管这里起了4个kernel，但由于每个kernel都起来多个block，因此计算速度比单一kernel但仅1个block要快得多。
使用static shared memory处理BLOCKDIM * 2个数据，提高计算效率

优化1

优化方法：

将cpu上计算放到gpu上去，同时避免内存拷贝

优化点：

将原来的true_index计算过程分为三步，每步起一个kernel
第一个kernel KeGetTrueNum统计cond_data是否为true，是则将true_num_array设为1，否则设为0
利用thrust::inclusive_scan计算true_num_array的前缀和（高优化点），由此得到true_index对应的索引值
最后一个kernel KeSetTrueIndex将对应索引值写入out_ptr中

待优化点：

当前极度依赖thrust::inclusive_scan在defalut stream上的表现，自写prefix sum kernel代替defalut stream上的thrust::inclusive_scan
尽量去除dev_ctx.Wait()步骤

先前优化版本PR30556

优化方法：

融合为一个kernel

thisjiang added 5 commits

December 18, 2020 11:39


          Merge pull request #1 from PaddlePaddle/develop

ff9053a

update Paddle to newest version


          Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d78104e

… develop


          Merge pull request #2 from PaddlePaddle/develop

a4c55cd

Merge newest Paddle code


          Merge pull request #3 from PaddlePaddle/develop

84eb899

merge newest Paddle code


          new optimize for whereindexop with prefix sum version, test=develop

044b39f

paddle-bot-old bot commented Jan 20, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

thisjiang mentioned this pull request

Optimize where_index_op(fused kernel) #30556

Closed

thisjiang changed the title ~~Optimize where_index_op by move to gpu~~ Optimize where_index_op(prefix sum)


          write a scan prefix sum kernel with stream for where index op, test=d…

cde9c19

…evelop

paddle-bot-old bot commented Feb 1, 2021

Sorry to inform you that cde9c19's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

thisjiang added 2 commits

February 2, 2021 11:54


          optimize where_index by using cub::DeviceScan::InclusiveSum instead o…

45442eb

…f imperfect self-kernel, test=develop


          Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

86f20e9

… optimize-whereindexop-prefix

paddle-bot-old bot commented Feb 10, 2021

Sorry to inform you that 86f20e9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.


          Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f505241

… optimize-whereindexop-prefix

Xreki reviewed

View reviewed changes

paddle/fluid/operators/where_index_op.cu Outdated

+                __host__ __device__ bool operator()(const T &val) {
+                  return static_cast<bool>(val);
+                }
+              };

Contributor

Xreki Mar 23, 2021

函数之间加个空行。

Contributor Author

thisjiang Mar 23, 2021

Done！所有函数间均已加空行

paddle/fluid/operators/where_index_op.cu Outdated

-                  for (int64_t i = 0; i < numel; i++) {
-                    if (static_cast<bool>(cond_data[i])) {
-                      h_true_index.push_back(i);
+              struct CheckTrue {

Contributor

Xreki Mar 23, 2021

感觉这个功能封装成一个struct的意义不大，一般将一些操作定义成Fuctor，是希望将Functor作为函数的参数，从而支持设置多种同类型不同的操作。

另外，结构名不建议使用动宾结构来命名。

Contributor Author

thisjiang Mar 23, 2021

已删除

paddle/fluid/operators/where_index_op.cu Outdated

+                      memory::Alloc(platform::CPUPlace(), (rank + 1) * sizeof(int64_t));
+                  int64_t *ptr_stride = reinterpret_cast<int64_t *>(d_tmp_mem->ptr());
+                  int64_t *ptr_true_num = ptr_stride + rank;
+                  int64_t *h_stride = reinterpret_cast<int64_t *>(h_tmp_mem->ptr());

Contributor

Xreki Mar 23, 2021

stride是什么含义？感觉变量的命名表义不够清晰。

Contributor Author

thisjiang Mar 23, 2021 •

edited

Loading

已重命名。ptr_stride是个数组，其中每个元素存的是每个维度的步长值（也就是stride），现在已重命名为stride_array。


          remove CheckTrue struct and rename stide_array for readable, test=dev…

0fb7f89

…elop

wzzju reviewed

View reviewed changes

paddle/fluid/operators/where_index_op.cu Outdated

Comment on lines 71 to 72

		auto d_tmp_mem = memory::Alloc(dev_ctx, (numel + rank) * sizeof(int64_t));
		auto h_tmp_mem =

Contributor

wzzju Apr 13, 2021

尽量少用tmp命名，变量名应该是尽可能的望文知意。

Contributor Author

thisjiang Apr 13, 2021

已修改所有变量名，并添加若干注释

paddle/fluid/operators/where_index_op.cu Outdated

+                  cub::DeviceScan::InclusiveSum(nullptr, cub_tmp_size, d_true_num, d_true_num,
+                                                numel, dev_ctx.stream());
+                  auto cub_tmp = memory::Alloc(dev_ctx, cub_tmp_size * sizeof(int64_t));
+                  void *ptr_mem = cub_tmp->ptr();

Contributor

wzzju Apr 13, 2021

ptr_mem不能做到望文知意。

Contributor Author

thisjiang Apr 13, 2021

已修改为cub_data标识这是cub库函数用到的数据

thisjiang added 2 commits

April 13, 2021 07:18


          optimize variable name for readable, test=develop

d7fd2da


          2021/04/13:Merge branch 'develop' of https://github.com/PaddlePaddle/…

af4d007

…Paddle into optimize-whereindexop-prefix

wzzju reviewed

View reviewed changes

paddle/fluid/operators/where_index_op.cu Outdated

+                const int64_t tid = blockIdx.x * blockDim.x + threadIdx.x;
+                for (int64_t idx = tid; idx < numel; idx += gridDim.x * blockDim.x) {
+                  true_num_array[idx] = static_cast<bool>(cond_data[idx]) ? 1 : 0;

Contributor

wzzju Apr 25, 2021

Improve performance: true_num_array[idx] = static_cast<bool>(cond_data[idx]) ? 1 : 0; --> true_num_array[idx] = static_cast<int64_t>(static_cast<bool>(cond_data[idx]));

Contributor Author

thisjiang Apr 25, 2021

Done

paddle/fluid/operators/where_index_op.cu Outdated

-                  for (int64_t i = 0; i < numel; i++) {
-                    if (static_cast<bool>(cond_data[i])) {
-                      h_true_index.push_back(i);
+              __global__ void KeGetTrueNum(const T *cond_data, const int64_t numel,

Contributor

wzzju Apr 25, 2021

KeGetTrueNum: What is Ke ? If it means kernel, just use GetTrueNum.

Contributor Author

thisjiang Apr 25, 2021

Done

paddle/fluid/operators/where_index_op.cu Outdated

+              }
+              template <typename T>
+              __global__ void KeSetTrueIndex(int64_t *out_ptr, const T *cond_data,

Contributor

wzzju Apr 25, 2021

The same above.

Contributor Author

thisjiang Apr 25, 2021

Done

thisjiang added 2 commits

April 25, 2021 10:41


          optimize function name and annotation, test=develop

0c6d2cc


          Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e7a8d39

… optimize-whereindexop-prefix

wzzju approved these changes

View reviewed changes

Contributor

wzzju left a comment

LGTM.

Xreki approved these changes

View reviewed changes

Contributor

Xreki left a comment

LGTM for op benchmark ci

wzzju merged commit 6ec4e64 into PaddlePaddle:develop

thisjiang deleted the optimize-whereindexop-prefix branch

April 26, 2021 11:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet