-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gridtools Reduction Design #1587
Comments
Implementations from CUB (https://nvlabs.github.io/cub/) or Thrust (https://github.com/NVIDIA/thrust/, nowadays probably based on CUB) might also be useful, just in case you haven‘t checked them out yet. |
I'll drop some comments, here we can discuss in more detail via vc or I can expand later:
|
@havogt Answering in the order:
|
Actually the reductions from the cuda samples are non-destructive (https://github.com/NVIDIA/cuda-samples/tree/master/Samples/reduction). Which examples did you look at? |
Indeed!! I was fooled by the signatures without |
Otherwise I do not see how to avoid the stencil... and the reduction on a sid is not composed with the stencils, anyway, right? |
I agree. For me the most intuitive API would be the following (modulo syntactic sugar):
E.g. for the dot product
|
but then reduce is not a member function of sid, right? |
We had a zoom discussion with @havogt. Here is the aftermath: We agreed that the An API that is proposed by @havogt will not work as is because We need a separate concept for an input of the reduction. Let us name it It looks like this solution will cover all reduction use cases that we currently have in mind. However ideally we want all our current We agreed on the following plan:
|
|
CUB actually does use intrinsics and even inline-PTX, for example here: https://github.com/NVlabs/cub/blob/618a46c27764f0e0b86fb3643a572ed039180ad8/cub/warp/specializations/warp_reduce_shfl.cuh#L146. But the code is indeed full of pre-C++11 boilerplate and quite annoying to follow… The advantage of CUB would be that we do not have to update the reduction code for every new hardware, but NVIDIA would do it for us. And there is even hipCUB, so the same for AMD: https://github.com/ROCmSoftwarePlatform/hipCUB (based on https://github.com/ROCmSoftwarePlatform/rocPRIM). |
As mentioned today in the standup the ability to compute multiple dot products at once can save a lot of memory bandwidth, so here are two pieces from a toy elliptic solver that could serve as a test case, where β = - np.dot(r, r)/np.dot(r, L(r)) # steepest descent
β = - np.dot(r, L(r))/np.dot(L(r), L(r)) # minimum residual |
If this is the syntax, we have some work to reorder it and fuse operations to make one reduction only.
Alternatively we provide just the interfaces for zipping the reductions and now it's user responsibility to get it efficient... What do we choose? |
Motivation
We need reductions over 3D data. The primary use case is a scalar product of floating point numbers. Implementation should be parallel and available for both GPU and CPU. Effective GPU implementation is the most important.
Analysis
Let us take NVIDIA provided parallel reduction algorithm as a basis. It is fast for CUDA GPUs and well optimized. There are two significant features of that algorithm:
This means that there is no way to do reduction on the fly -- we need to maintain a dedicated storage that can not be used for anything else (because of destructive nature of the algorithm). Additionally this storage should be padded to the power of two size and padding should be filled with the reduction initial value.
Proposal
Let us add the concept of
Reducible
.Reducible
should modelSid
(whichreference_type
is non const l-value reference ) and additionally should haveT sid_reduce(Reducible&&)
function available by ADL. This is a host function.It will be at least two models of
Reducible
-- for GPU and for CPU. Mind the&&
in thesid_reduce
signature -- it accepts only r-value. This reflects the destructive nature of the algorithm. Reduction is the last thing you can do with the data. After it should be thrown away.Usage Example
Details
When reducible is created, memory allocation should be padded to the power of two. All memory (including paddings) should be filled by initial value. For the case of sum reduction the filling can be done effectively by
cudaMemset
. For other reductions we need to launch a separate kernel at this point.Memory allocation could be maintained by
sid::cached_allocator
mechanizmThe text was updated successfully, but these errors were encountered: