Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize where_index_op(prefix sum) #30601

Merged
merged 14 commits into from
Apr 26, 2021

Conversation

thisjiang
Copy link
Contributor

@thisjiang thisjiang commented Jan 20, 2021

PR types

Performance optimization

PR changes

OPs

Describe

优化

优化效果:

修改 ips
原始版本 5.55 steps/s
优化1 5.67 steps/s
优化2 5.68 steps/s
优化3 5.71 steps/s

竞品对比(测试数据大小[100, 100, 100]):

V100-SXM2-32GB机器循环100次 tf.where torch.nonzero paddle.where优化前 优化0PR30556 优化1 优化2 优化3
cost 2.704 s 0.016 s 1.959 s 0.343 s 0.032 0.030 s 0.025 s

可读性性优化d7fd2da

优化变量名并增加若干注释

优化345442eb

优化方法:

  1. 使用cub库的cub::DeviceScan::InclusiveSum方法替代自写的不完美的prefix sum kernel

优化点:

  1. 将所有临时变量的alloc操作移到kernel前
  2. 将自写的KeScanPrefixSum kernel替换为cub库带的cub::DeviceScan::InclusiveSum,提高可靠性
  3. out->Resize一次,用InclusiveSum而非ExclusiveSum的一个原因也是为了可以直接得到true_num
  4. KeGetTrueNumKeSetTrueIndex此时都做到了访存合并

优化2cde9c19

优化方法:

  1. 自写prefix sum kernel代替thrust::inclusive_scan

优化点:

  1. 根据Parallel Prefix Sum (Scan) with CUDA论文的方法自写了一个计算前缀和kernel的kernel,避免了因thrust::inclusive_scandefalut stream导致的不稳定性
  2. 将步骤分为了4个kernel:
    • 第一个kernel根据是否为true设置对应位为0或1;
    • 第二个kernel计算每个block的前缀和,这里就可以得到true_index的位置了;
    • 第三个kernel将每个block的前缀加的base_index计算出来;
    • 最后一个kernel根据true_index加base_index得到index放到out的对应位置。
      尽管这里起了4个kernel,但由于每个kernel都起来多个block,因此计算速度比单一kernel但仅1个block要快得多。
  3. 使用static shared memory处理BLOCKDIM * 2个数据,提高计算效率

优化1

优化方法:

  1. 将cpu上计算放到gpu上去,同时避免内存拷贝

优化点:

  1. 将原来的true_index计算过程分为三步,每步起一个kernel
  2. 第一个kernel KeGetTrueNum统计cond_data是否为true,是则将true_num_array设为1,否则设为0
  3. 利用thrust::inclusive_scan计算true_num_array的前缀和(高优化点),由此得到true_index对应的索引值
  4. 最后一个kernel KeSetTrueIndex将对应索引值写入out_ptr

待优化点:

  1. 当前极度依赖thrust::inclusive_scandefalut stream上的表现,自写prefix sum kernel代替defalut stream上的thrust::inclusive_scan
  2. 尽量去除dev_ctx.Wait()步骤

先前优化版本PR30556

优化方法:

融合为一个kernel

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@thisjiang thisjiang changed the title Optimize where_index_op by move to gpu Optimize where_index_op(prefix sum) Jan 20, 2021
@paddle-bot-old
Copy link

paddle-bot-old bot commented Feb 1, 2021

Sorry to inform you that cde9c19's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@paddle-bot-old
Copy link

Sorry to inform you that 86f20e9's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

__host__ __device__ bool operator()(const T &val) {
return static_cast<bool>(val);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数之间加个空行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!所有函数间均已加空行

for (int64_t i = 0; i < numel; i++) {
if (static_cast<bool>(cond_data[i])) {
h_true_index.push_back(i);
struct CheckTrue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这个功能封装成一个struct的意义不大,一般将一些操作定义成Fuctor,是希望将Functor作为函数的参数,从而支持设置多种同类型不同的操作。

另外,结构名不建议使用动宾结构来命名。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

memory::Alloc(platform::CPUPlace(), (rank + 1) * sizeof(int64_t));
int64_t *ptr_stride = reinterpret_cast<int64_t *>(d_tmp_mem->ptr());
int64_t *ptr_true_num = ptr_stride + rank;
int64_t *h_stride = reinterpret_cast<int64_t *>(h_tmp_mem->ptr());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride是什么含义?感觉变量的命名表义不够清晰。

Copy link
Contributor Author

@thisjiang thisjiang Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已重命名。ptr_stride是个数组,其中每个元素存的是每个维度的步长值(也就是stride),现在已重命名为stride_array。

Comment on lines 71 to 72
auto d_tmp_mem = memory::Alloc(dev_ctx, (numel + rank) * sizeof(int64_t));
auto h_tmp_mem =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

尽量少用tmp命名,变量名应该是尽可能的望文知意。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改所有变量名,并添加若干注释

cub::DeviceScan::InclusiveSum(nullptr, cub_tmp_size, d_true_num, d_true_num,
numel, dev_ctx.stream());
auto cub_tmp = memory::Alloc(dev_ctx, cub_tmp_size * sizeof(int64_t));
void *ptr_mem = cub_tmp->ptr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr_mem不能做到望文知意。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改为cub_data标识这是cub库函数用到的数据

const int64_t tid = blockIdx.x * blockDim.x + threadIdx.x;

for (int64_t idx = tid; idx < numel; idx += gridDim.x * blockDim.x) {
true_num_array[idx] = static_cast<bool>(cond_data[idx]) ? 1 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve performance: true_num_array[idx] = static_cast<bool>(cond_data[idx]) ? 1 : 0; --> true_num_array[idx] = static_cast<int64_t>(static_cast<bool>(cond_data[idx]));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (int64_t i = 0; i < numel; i++) {
if (static_cast<bool>(cond_data[i])) {
h_true_index.push_back(i);
__global__ void KeGetTrueNum(const T *cond_data, const int64_t numel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeGetTrueNum: What is Ke ? If it means kernel, just use GetTrueNum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

template <typename T>
__global__ void KeSetTrueIndex(int64_t *out_ptr, const T *cond_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@wzzju wzzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@wzzju wzzju merged commit 6ec4e64 into PaddlePaddle:develop Apr 26, 2021
@thisjiang thisjiang deleted the optimize-whereindexop-prefix branch April 26, 2021 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants