Support max_n and min_n reductions on GPU #1196
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #1177.
This adds support for
max_n
andmin_n
reductions on a GPU, both with and withoutdask
. The key change is to add new CUDA mutex functionality to support CUDAappend
functions (i.e. individual pixel callbacks) that do more than a simple get/set operation. Because of the massively parallel nature of CUDA hardware multiple threads can access the samecanvas
pixel at the same time, and up until now we have been restricted to CUDA atomic operations (https://numba.readthedocs.io/en/stable/cuda/intrinsics.html#supported-atomic-operations) inappend
functions. With the new mutex we can lock access to a particular pixel to a single thread at a time and thus perform more complicated operations such as formax_n
without any race conditions.In implementation we need to get the mutex (a
cupy
array) to the CUDAappend
functions and this is achieved within theexpand_aggs_and_cols
framework by appending the mutex array in themake_info
function which is where other arrays and/or dataframe columns are extracted and passed toappend
functions. This ensures that there is only ever a single shared mutex even if multiple reductions need it.This implementation is limited by what is currently available in
numba
0.56 which means we can only lock/unlock the mutex as a whole rather than individual elements/pixels of it so the performance will not be great. Numba PR numba/numba#8790 will allow us to lock individual pixels, so whennumba
0.57 is released I will write another PR to check use the fast route if that is available otherwise drop back to this slower one.There is no support yet for
where(max_n)
on CUDA, but this will follow in another PR soon.