Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log_softmax forward case#1: axis=-1 #31630

Merged
merged 26 commits into from
Apr 10, 2021
Merged

Conversation

AshburnLee
Copy link
Contributor

@AshburnLee AshburnLee commented Mar 15, 2021

PR types

Performance optimization

PR changes

OPs

Describe

功能

实现log_softmax的cuda版本。如下是前向计算的3个case。当前PR实现case#1。

if (inner_size == 1) {
    if (dim_size <= 1024 && dim_size * sizeof(T) <= 4096) {
        case#1  
    } else {
        case#2
    }
} else {
    case#3
}

说明

cuda实现支持了float16,原Eigen实现不支持float16。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@AshburnLee AshburnLee changed the title Log softmax temorary PR Log_softmax forward case#1: axis=-1 Mar 16, 2021
@@ -1,4 +1,4 @@
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不是今年新增的,不用改copyright吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ops::LogSoftmaxKernel<plat::CUDADeviceContext, plat::float16>);
REGISTER_OP_CUDA_KERNEL(log_softmax,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么把float16类型去掉了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,支持了float16。

break;

template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX>
__device__ __forceinline__ void warp_reduce_sum(T* sum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名应准确地表达函数的功能,函数命名也需要符合Google C++代码风格warp_reduce_sum -> BatchWarpReduceSum

下同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dst, src, batch_count, softmax_elements_stride, softmax_elements); \
break;

template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 模板中变量名用AxxBxx这种驼峰式命名方式。
  • 这里WARP_BATCH应该是说一个warp负责计算几个batch吧,那不如直接叫NumBatchBatchSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace operators {

#define WARP_SIZE 32
int log2_ceil(int value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名除一些类里面的setter、getter函数外,都采用AxxBxx这种命名方式,看一下Google C++代码规范

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

struct LogSoftmaxCUDAFunctor {
void operator()(const DeviceContext& context, const framework::Tensor* X,
framework::Tensor* Out, const int axis) {
int along_axis = (axis < 0) ? axis + X->dims().size() : axis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CanonicalAxis已经对axis做了换算了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done。已删除

int inner_size = 1;
for (int i = 0; i < along_axis; i++) outer_size *= X->dims()[i];
for (int i = along_axis + 1; i < X->dims().size(); i++)
inner_size *= X->dims()[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SizeToAxis和SizeFromAxis可以分别计算outer_size和inner_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer_size可以用SizeToAxis()得到;inner_size的计算与SizeFromAxis()有差别。这里应该调SizeOutAxis()。但是SizeOutAxis()定义在其他.cu文件中,在该文件中不能直接调用。(nvcc 没有开启 --relocatable-device-code=true --compile,开启后可以调用)。

所以保留inner_size,用SizeToAxis()获得outer_size。

}

template <typename DeviceContext, typename T>
struct LogSoftmaxCUDAFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这一层的封装没有必要。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
void Compute(const framework::ExecutionContext& context) const override {
const auto* X = context.Input<framework::Tensor>("X");
auto* Out = context.Output<framework::Tensor>("Out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名命名:axx_bxx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名都改为了这种形式。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X、Out还没改。

}

template <typename T>
void LogSoftmaxForwardAxisLast(T* dst, const T* src, int softmax_elements,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数主要的功能是启动CUDA Kernel,所以可以叫LaunchLogSoftmaxForwardForLastAxis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Xreki
Copy link
Contributor

Xreki commented Mar 17, 2021

PR title和描述都再补充详细一点。

for (int i = 0; i < WARP_BATCH; ++i) {
#pragma unroll
for (int it = 0; it < WARP_ITERATIONS; ++it) {
sum[i] += std::exp(elements[i][it] - max_value[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float16的时候会有问题吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是因为__shfl_xor_sync&__shfl_xor不支持fp16。应该是可以处理的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,已处理。

Copy link
Contributor Author

@AshburnLee AshburnLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经按照review意见做了修改

ops::LogSoftmaxKernel<plat::CUDADeviceContext, plat::float16>);
REGISTER_OP_CUDA_KERNEL(log_softmax,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,支持了float16。

for (int i = 0; i < WARP_BATCH; ++i) {
#pragma unroll
for (int it = 0; it < WARP_ITERATIONS; ++it) {
sum[i] += std::exp(elements[i][it] - max_value[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,已处理。

@@ -1,4 +1,4 @@
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace operators {

#define WARP_SIZE 32
int log2_ceil(int value);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dst, src, batch_count, softmax_elements_stride, softmax_elements); \
break;

template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

struct LogSoftmaxCUDAFunctor {
void operator()(const DeviceContext& context, const framework::Tensor* X,
framework::Tensor* Out, const int axis) {
int along_axis = (axis < 0) ? axis + X->dims().size() : axis;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done。已删除

int inner_size = 1;
for (int i = 0; i < along_axis; i++) outer_size *= X->dims()[i];
for (int i = along_axis + 1; i < X->dims().size(); i++)
inner_size *= X->dims()[i];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer_size可以用SizeToAxis()得到;inner_size的计算与SizeFromAxis()有差别。这里应该调SizeOutAxis()。但是SizeOutAxis()定义在其他.cu文件中,在该文件中不能直接调用。(nvcc 没有开启 --relocatable-device-code=true --compile,开启后可以调用)。

所以保留inner_size,用SizeToAxis()获得outer_size。

constexpr int KERNEL_WARP_SIZE =
(next_power_of_two < WARP_SIZE) ? next_power_of_two : WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / KERNEL_WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
void Compute(const framework::ExecutionContext& context) const override {
const auto* X = context.Input<framework::Tensor>("X");
auto* Out = context.Output<framework::Tensor>("Out");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名都改为了这种形式。

}

template <typename DeviceContext, typename T>
struct LogSoftmaxCUDAFunctor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#define LAUNCH_SOFTMAX_WARP_FORWARD(L2E) \
case L2E: \
WarpLogSoftmaxForward<T, double, L2E><<<blocks, threads, 0>>>( \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要都用double,double速度会很慢。

int element_index = local_idx + it * kernel_warp_size;
if (element_index < batch_element_count) {
elements[i][it] =
static_cast<double>(src[i * element_count + it * kernel_warp_size]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要都用double

// 3.store result
#pragma unroll
for (int i = 0; i < num_batch; ++i) {
if (i >= local_batches) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种if语句分行写,并且都加上{}。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

template <typename T, typename AccT, int log2_elements>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模板里面的变量其实是常量,命名用AxxBxx形式,以跟函数里面的变量区分。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
void Compute(const framework::ExecutionContext& context) const override {
const auto* X = context.Input<framework::Tensor>("X");
auto* Out = context.Output<framework::Tensor>("Out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X、Out还没改。

for (int i = axis + 1; i < X->dims().size(); i++)
inner_size *= X->dims()[i];
int outer_size = 1;
outer_size = SizeToAxis(axis, X->dims());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

191和192可以合成1行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -12,7 +12,177 @@
// See the License for the specific language governing permissions and
// limitations under the License.

#include <cuda_runtime.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP上会找不到cuda_runtime.h,可以试试看删掉这个头文件应该也可以运行,或者写成

#ifdef __HIPCC__
#include <hip/hip_runtime.h>
#else
#include <cuda_runtime.h>
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
int outer_size = SizeToAxis(axis, x->dims());

if (inner_size == 1 && dim_size <= 1024 && dim_size * sizeof(T) <= 4096) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if里面为什么要加&& dim_size * sizeof(T) <= 4096这个判断呢?不支持double吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持double。当把&& dim_size * sizeof(T) <= 4096删去,可以正确执行,但是一致性的diff从0.0 变为1.0728e-6(atol=1.00e-6)。

&& dim_size <= 1024是必要的。

当outer_size=128,dim_size=1024时,有config<<<32, (32, 4)>>>,warp_iter=32,正确执行。
当outer_size=128,dim_size=1025时,有config<<<32, (32, 4)>>>,warp_iter=64,不能得到结果。

warp_iter表示一个thread使用到的寄存器,应该是warp_iter=64超过硬件限制了。

break;

template <typename T, int KernelWarpSize>
__device__ __forceinline__ void ReduceSumForWarpBatch(T &sum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++一般都传const T&,不推荐这种方式修改传进来的实参。另外这个函数既然去掉了计算batch的循环,那函数名就应该去掉batch。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

将两个WarpReduce函数的返回值从void改为T。

其实可以调用math_cuda_utils.h的函数,只是其中的函数WARP_SIZE为常量32,但这里的WARP_SIZE不一定是32。

改名Done.

case near_greater_power_of_two: \
ComputeForwardInWarp<T, double, \
near_greater_power_of_two><<<blocks, threads, 0>>>( \
dst, src, outer_size, dim_size, dim_size); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要传2个dim_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一个就够,Done.

int kernel_warp_size =
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32;
int warps_per_block = (threads_per_block / kernel_warp_size);
int blocks = (outer_size + warps_per_block - 1) / warps_per_block;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于输入[N, 32]和[N, 128],kernel_warp_size=32,warps_per_block=4,这2种情况都是一个线程block分成4组,每组线程(即一个warp)计算1个batch?

Copy link
Contributor Author

@AshburnLee AshburnLee Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每组线程确实计算1个batch。

当N确定,通过观察configure<<<blocks, threads>>>随dim_size变化的变化,可以发现:
当dim_size>16时,threads始终为(32,4),变化的是blocks,和warp_iter,batch是1

假设N=128,变量有以下变化

  • 对于输入[N, 32]:dim_size: 32, kernel_warp_size: 32, warp_iter: 1, warp_batch: 1, config<<<4, (32, 4)>>>, numElem: 512 numThreads: 512

  • 对于输入[N, 128]: dim_size: 128, kernel_warp_size: 32, warp_iter: 4, warp_batch: 1, config<<<4, (32, 4)>>>, numElem: 2048 numThreads: 2048

这里numThreads表示线程数及其循环次数

确认是计算1个batch。

}

template <typename T, typename AccT, int NearGreaterPowerOfTwo>
__global__ void ComputeForwardInWarp(T *dst, const T *src, int batch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名还是要体现功能,LogSoftmaxForwardXxx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

paddle/fluid/operators/log_softmax_op.cu Show resolved Hide resolved
constexpr int kernel_warp_size =
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32;
constexpr int warp_iter = near_greater_power_of_two / kernel_warp_size;
int warp_id = blockDim.y * blockIdx.x + threadIdx.y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是global_batch_id。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据上一条回复,我觉得改为global_warp_id 更合适。

for (int it = 0; it < warp_iter; ++it) {
int element_index = thread_in_warp_idx + it * kernel_warp_size;
if (element_index < effective_element_count) {
elements[it] = static_cast<double>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double -> AccT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模板设置:

template <typename T, typename AccT>
void LaunchSoftmaxForwardForLastAxis(....) {
    ...
}

外层调用:LaunchSoftmaxForwardForLastAxis<T, MPTypeTrait<T>::Type>(...),即可解决模板调用中的double。MPTypeTrait的定义见:

template <typename T>
class MPTypeTrait {
public:
using Type = T;
};
template <>
class MPTypeTrait<platform::float16> {
public:
using Type = float;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 谢谢提供的解决方案!

const auto *input_data = x->data<T>();
auto *output_data = out->mutable_data<T>(context.GetPlace());

PADDLE_ENFORCE_GT(x->numel(), 0, platform::errors::InvalidArgument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种检查不需要,InferShape中的检查一般能覆盖到。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#define LAUNCH_WARP_FORWAR_COMPUTE(near_greater_power_of_two) \
case near_greater_power_of_two: \
ComputeLogSoftmaxForwardInWarp< \
T, AccT, near_greater_power_of_two><<<blocks, threads, 0>>>( \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA Kernel启动要传入stream。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

constexpr int kernel_warp_size =
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32;
constexpr int warp_iter = near_greater_power_of_two / kernel_warp_size;
int global_warp_id = blockDim.y * blockIdx.x + threadIdx.y;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 输入为[batch_size, element_count],near_greater_power_of_two代表了elemen_count。所以这个kernel是使用kernel_warp_size个线程来计算一行,所以global_warp_id也就是全局的行号。

  • 另外,kernel_warp_size也就是blockDim.x。

  • 在CUDA里面,warp是硬件层面的概念,这里说的一个warp其实是一行,概念上容易引起困惑。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 是的。global_warp_id是全局的行号。
  • 是的。
  • 一个warp处理一行,所以这个global_warp_id 改名为 batch_id。

paddle/fluid/operators/log_softmax_op.cu Show resolved Hide resolved

// 2.compute max_value. For each thread, loop all registers to find max
AccT max_value;
max_value = elements[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L92和L93可以合并成一行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int element_index = thread_in_warp_idx + it * kernel_warp_size;
if (element_index < element_count) {
dst[global_warp_id * element_count + element_index] =
elements[it] - max_value - sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

写回数据使用static_cast显式转换成T类型吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@qili93 qili93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants