-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212
Comments
Just a thought:
|
Sorry for the late update, both seems to be true. Block sizes <= 256 work and adding launch bounds make it also work with a block size of 512. |
This is because of this change: |
I have the same issue on |
The following code produces wrong results when compiled with optimization flags other than
-O0
. Tested on ROCm 3.9 and 4.0, on gfx906 (AMD Mi50) and gfx908 (AMD Mi100) cards.Compiled with disabled optimizations, that is
hipcc -O0 code.cpp
, the compiled code produces the expected output, while higher optimization levels (-O1
or higher) fail. Thus, we assume there is an error in one of the compiler’s optimization passes.The bug was detected while debugging a GPU reduction code, where the same pattern was is for inter-warp/wavefront reductions.
On NVIDIA GPUs (directly compiled with
nvcc -x cu code.cpp
), the expected result is produced.The text was updated successfully, but these errors were encountered: