-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log_softmax forward case#1: axis=-1 #31630
Conversation
Update forked PaddlePaddle
Update my fork
update from PaddlePaddle
Update forked paddle repo
Update USERNAME/paddle
update Paddle USERNAME repo
update username repo
Thanks for your contribution! |
@@ -1,4 +1,4 @@ | |||
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件不是今年新增的,不用改copyright吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ops::LogSoftmaxKernel<plat::CUDADeviceContext, plat::float16>); | ||
REGISTER_OP_CUDA_KERNEL(log_softmax, | ||
ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>, | ||
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么把float16类型去掉了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,支持了float16。
break; | ||
|
||
template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX> | ||
__device__ __forceinline__ void warp_reduce_sum(T* sum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名应准确地表达函数的功能,函数命名也需要符合Google C++代码风格。warp_reduce_sum
-> BatchWarpReduceSum
下同
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dst, src, batch_count, softmax_elements_stride, softmax_elements); \ | ||
break; | ||
|
||
template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 模板中变量名用
AxxBxx
这种驼峰式命名方式。 - 这里WARP_BATCH应该是说一个warp负责计算几个batch吧,那不如直接叫
NumBatch
或BatchSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
namespace operators { | ||
|
||
#define WARP_SIZE 32 | ||
int log2_ceil(int value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名除一些类里面的setter、getter函数外,都采用AxxBxx
这种命名方式,看一下Google C++代码规范
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
struct LogSoftmaxCUDAFunctor { | ||
void operator()(const DeviceContext& context, const framework::Tensor* X, | ||
framework::Tensor* Out, const int axis) { | ||
int along_axis = (axis < 0) ? axis + X->dims().size() : axis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CanonicalAxis已经对axis做了换算了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done。已删除
int inner_size = 1; | ||
for (int i = 0; i < along_axis; i++) outer_size *= X->dims()[i]; | ||
for (int i = along_axis + 1; i < X->dims().size(); i++) | ||
inner_size *= X->dims()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SizeToAxis和SizeFromAxis可以分别计算outer_size和inner_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outer_size可以用SizeToAxis()得到;inner_size的计算与SizeFromAxis()有差别。这里应该调SizeOutAxis()。但是SizeOutAxis()定义在其他.cu文件中,在该文件中不能直接调用。(nvcc 没有开启 --relocatable-device-code=true --compile,开启后可以调用)。
所以保留inner_size,用SizeToAxis()获得outer_size。
} | ||
|
||
template <typename DeviceContext, typename T> | ||
struct LogSoftmaxCUDAFunctor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这一层的封装没有必要。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public: | ||
void Compute(const framework::ExecutionContext& context) const override { | ||
const auto* X = context.Input<framework::Tensor>("X"); | ||
auto* Out = context.Output<framework::Tensor>("Out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名命名:axx_bxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名都改为了这种形式。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X、Out还没改。
} | ||
|
||
template <typename T> | ||
void LogSoftmaxForwardAxisLast(T* dst, const T* src, int softmax_elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数主要的功能是启动CUDA Kernel,所以可以叫LaunchLogSoftmaxForwardForLastAxis
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
PR title和描述都再补充详细一点。 |
for (int i = 0; i < WARP_BATCH; ++i) { | ||
#pragma unroll | ||
for (int it = 0; it < WARP_ITERATIONS; ++it) { | ||
sum[i] += std::exp(elements[i][it] - max_value[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float16的时候会有问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是因为__shfl_xor_sync&__shfl_xor不支持fp16。应该是可以处理的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,已处理。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经按照review意见做了修改
ops::LogSoftmaxKernel<plat::CUDADeviceContext, plat::float16>); | ||
REGISTER_OP_CUDA_KERNEL(log_softmax, | ||
ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>, | ||
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,支持了float16。
for (int i = 0; i < WARP_BATCH; ++i) { | ||
#pragma unroll | ||
for (int it = 0; it < WARP_ITERATIONS; ++it) { | ||
sum[i] += std::exp(elements[i][it] - max_value[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done,已处理。
@@ -1,4 +1,4 @@ | |||
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
namespace operators { | ||
|
||
#define WARP_SIZE 32 | ||
int log2_ceil(int value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
dst, src, batch_count, softmax_elements_stride, softmax_elements); \ | ||
break; | ||
|
||
template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
struct LogSoftmaxCUDAFunctor { | ||
void operator()(const DeviceContext& context, const framework::Tensor* X, | ||
framework::Tensor* Out, const int axis) { | ||
int along_axis = (axis < 0) ? axis + X->dims().size() : axis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done。已删除
int inner_size = 1; | ||
for (int i = 0; i < along_axis; i++) outer_size *= X->dims()[i]; | ||
for (int i = along_axis + 1; i < X->dims().size(); i++) | ||
inner_size *= X->dims()[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outer_size可以用SizeToAxis()得到;inner_size的计算与SizeFromAxis()有差别。这里应该调SizeOutAxis()。但是SizeOutAxis()定义在其他.cu文件中,在该文件中不能直接调用。(nvcc 没有开启 --relocatable-device-code=true --compile,开启后可以调用)。
所以保留inner_size,用SizeToAxis()获得outer_size。
constexpr int KERNEL_WARP_SIZE = | ||
(next_power_of_two < WARP_SIZE) ? next_power_of_two : WARP_SIZE; | ||
constexpr int WARP_ITERATIONS = next_power_of_two / KERNEL_WARP_SIZE; | ||
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public: | ||
void Compute(const framework::ExecutionContext& context) const override { | ||
const auto* X = context.Input<framework::Tensor>("X"); | ||
auto* Out = context.Output<framework::Tensor>("Out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名都改为了这种形式。
} | ||
|
||
template <typename DeviceContext, typename T> | ||
struct LogSoftmaxCUDAFunctor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
#define LAUNCH_SOFTMAX_WARP_FORWARD(L2E) \ | ||
case L2E: \ | ||
WarpLogSoftmaxForward<T, double, L2E><<<blocks, threads, 0>>>( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要都用double,double速度会很慢。
int element_index = local_idx + it * kernel_warp_size; | ||
if (element_index < batch_element_count) { | ||
elements[i][it] = | ||
static_cast<double>(src[i * element_count + it * kernel_warp_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要都用double
// 3.store result | ||
#pragma unroll | ||
for (int i = 0; i < num_batch; ++i) { | ||
if (i >= local_batches) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种if语句分行写,并且都加上{}。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
template <typename T, typename AccT, int log2_elements> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模板里面的变量其实是常量,命名用AxxBxx
形式,以跟函数里面的变量区分。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public: | ||
void Compute(const framework::ExecutionContext& context) const override { | ||
const auto* X = context.Input<framework::Tensor>("X"); | ||
auto* Out = context.Output<framework::Tensor>("Out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X、Out还没改。
for (int i = axis + 1; i < X->dims().size(); i++) | ||
inner_size *= X->dims()[i]; | ||
int outer_size = 1; | ||
outer_size = SizeToAxis(axis, X->dims()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
191和192可以合成1行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -12,7 +12,177 @@ | |||
// See the License for the specific language governing permissions and | |||
// limitations under the License. | |||
|
|||
#include <cuda_runtime.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HIP上会找不到cuda_runtime.h,可以试试看删掉这个头文件应该也可以运行,或者写成
#ifdef __HIPCC__
#include <hip/hip_runtime.h>
#else
#include <cuda_runtime.h>
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
int outer_size = SizeToAxis(axis, x->dims()); | ||
|
||
if (inner_size == 1 && dim_size <= 1024 && dim_size * sizeof(T) <= 4096) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if里面为什么要加&& dim_size * sizeof(T) <= 4096
这个判断呢?不支持double吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
支持double。当把&& dim_size * sizeof(T) <= 4096
删去,可以正确执行,但是一致性的diff从0.0 变为1.0728e-6(atol=1.00e-6)。
&& dim_size <= 1024
是必要的。
当outer_size=128,dim_size=1024时,有config<<<32, (32, 4)>>>,warp_iter=32,正确执行。
当outer_size=128,dim_size=1025时,有config<<<32, (32, 4)>>>,warp_iter=64,不能得到结果。
warp_iter表示一个thread使用到的寄存器,应该是warp_iter=64超过硬件限制了。
break; | ||
|
||
template <typename T, int KernelWarpSize> | ||
__device__ __forceinline__ void ReduceSumForWarpBatch(T &sum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++一般都传const T&
,不推荐这种方式修改传进来的实参。另外这个函数既然去掉了计算batch的循环,那函数名就应该去掉batch。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
将两个WarpReduce函数的返回值从void改为T。
其实可以调用math_cuda_utils.h的函数,只是其中的函数WARP_SIZE为常量32,但这里的WARP_SIZE不一定是32。
改名Done.
case near_greater_power_of_two: \ | ||
ComputeForwardInWarp<T, double, \ | ||
near_greater_power_of_two><<<blocks, threads, 0>>>( \ | ||
dst, src, outer_size, dim_size, dim_size); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要传2个dim_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个就够,Done.
int kernel_warp_size = | ||
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32; | ||
int warps_per_block = (threads_per_block / kernel_warp_size); | ||
int blocks = (outer_size + warps_per_block - 1) / warps_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于输入[N, 32]和[N, 128],kernel_warp_size=32,warps_per_block=4,这2种情况都是一个线程block分成4组,每组线程(即一个warp)计算1个batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每组线程确实计算1个batch。
当N确定,通过观察configure<<<blocks, threads>>>
随dim_size变化的变化,可以发现:
当dim_size>16时,threads
始终为(32,4),变化的是blocks
,和warp_iter
,batch是1。
假设N=128,变量有以下变化
-
对于输入[N, 32]:
dim_size: 32, kernel_warp_size: 32, warp_iter: 1, warp_batch: 1, config<<<4, (32, 4)>>>, numElem: 512 numThreads: 512
-
对于输入[N, 128]:
dim_size: 128, kernel_warp_size: 32, warp_iter: 4, warp_batch: 1, config<<<4, (32, 4)>>>, numElem: 2048 numThreads: 2048
这里numThreads表示线程数及其循环次数。
确认是计算1个batch。
} | ||
|
||
template <typename T, typename AccT, int NearGreaterPowerOfTwo> | ||
__global__ void ComputeForwardInWarp(T *dst, const T *src, int batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名还是要体现功能,LogSoftmaxForwardXxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
constexpr int kernel_warp_size = | ||
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32; | ||
constexpr int warp_iter = near_greater_power_of_two / kernel_warp_size; | ||
int warp_id = blockDim.y * blockIdx.x + threadIdx.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是global_batch_id。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据上一条回复,我觉得改为global_warp_id 更合适。
for (int it = 0; it < warp_iter; ++it) { | ||
int element_index = thread_in_warp_idx + it * kernel_warp_size; | ||
if (element_index < effective_element_count) { | ||
elements[it] = static_cast<double>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double -> AccT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模板设置:
template <typename T, typename AccT>
void LaunchSoftmaxForwardForLastAxis(....) {
...
}
外层调用:LaunchSoftmaxForwardForLastAxis<T, MPTypeTrait<T>::Type>(...)
,即可解决模板调用中的double。MPTypeTrait的定义见:
Paddle/paddle/fluid/operators/amp/fp16_type_traits.h
Lines 23 to 33 in 3f66e7d
template <typename T> | |
class MPTypeTrait { | |
public: | |
using Type = T; | |
}; | |
template <> | |
class MPTypeTrait<platform::float16> { | |
public: | |
using Type = float; | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 谢谢提供的解决方案!
const auto *input_data = x->data<T>(); | ||
auto *output_data = out->mutable_data<T>(context.GetPlace()); | ||
|
||
PADDLE_ENFORCE_GT(x->numel(), 0, platform::errors::InvalidArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种检查不需要,InferShape中的检查一般能覆盖到。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
#define LAUNCH_WARP_FORWAR_COMPUTE(near_greater_power_of_two) \ | ||
case near_greater_power_of_two: \ | ||
ComputeLogSoftmaxForwardInWarp< \ | ||
T, AccT, near_greater_power_of_two><<<blocks, threads, 0>>>( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA Kernel启动要传入stream。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
constexpr int kernel_warp_size = | ||
(near_greater_power_of_two < 32) ? near_greater_power_of_two : 32; | ||
constexpr int warp_iter = near_greater_power_of_two / kernel_warp_size; | ||
int global_warp_id = blockDim.y * blockIdx.x + threadIdx.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
输入为
[batch_size, element_count]
,near_greater_power_of_two代表了elemen_count。所以这个kernel是使用kernel_warp_size个线程来计算一行,所以global_warp_id也就是全局的行号。 -
另外,kernel_warp_size也就是blockDim.x。
-
在CUDA里面,warp是硬件层面的概念,这里说的一个warp其实是一行,概念上容易引起困惑。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 是的。global_warp_id是全局的行号。
- 是的。
- 一个warp处理一行,所以这个global_warp_id 改名为 batch_id。
|
||
// 2.compute max_value. For each thread, loop all registers to find max | ||
AccT max_value; | ||
max_value = elements[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L92和L93可以合并成一行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
int element_index = thread_in_warp_idx + it * kernel_warp_size; | ||
if (element_index < element_count) { | ||
dst[global_warp_id * element_count + element_index] = | ||
elements[it] - max_value - sum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写回数据使用static_cast
显式转换成T类型吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
OPs
Describe
功能
实现log_softmax的cuda版本。如下是前向计算的3个case。当前PR实现case#1。
说明
cuda实现支持了float16,原Eigen实现不支持float16。