Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

fthaler · 2021-01-07T14:37:22Z

The following code produces wrong results when compiled with optimization flags other than -O0. Tested on ROCm 3.9 and 4.0, on gfx906 (AMD Mi50) and gfx908 (AMD Mi100) cards.

Compiled with disabled optimizations, that is hipcc -O0 code.cpp, the compiled code produces the expected output, while higher optimization levels (-O1 or higher) fail. Thus, we assume there is an error in one of the compiler’s optimization passes.

The bug was detected while debugging a GPU reduction code, where the same pattern was is for inter-warp/wavefront reductions.

On NVIDIA GPUs (directly compiled with nvcc -x cu code.cpp), the expected result is produced.

#include <iostream>
#include <numeric>
#include <vector>

#ifdef __CUDACC__
#define GPU(x) cuda##x
#else
#include <hip/hip_runtime.h>
using hipDeviceProp = hipDeviceProp_t;
#define GPU(x) hip##x
#endif

template <int BlockSize, int WarpSize>
__global__ void kernel(int const *__restrict__ in, int *__restrict__ out) {
  // one shared value per warp
  __shared__ int shared[BlockSize / WarpSize];

  // thread zero of each warp puts a value into shared mem
  if (threadIdx.x % WarpSize == 0)
    shared[threadIdx.x / WarpSize] = in[threadIdx.x];

  __syncthreads();

  // move result to global memory
  if (threadIdx.x < BlockSize / WarpSize)
    out[threadIdx.x] = shared[threadIdx.x];
}

int main(int argc, char *argv[]) {
  using namespace std;

#define CHECK(x)                                                               \
  if ((x) != GPU(Success)) {                                                   \
    cerr << "Error in " << #x << " on line " << __LINE__ << endl;              \
    return 1;                                                                  \
  }

  int device;
  CHECK(GPU(GetDevice)(&device));
  GPU(DeviceProp) props;
  CHECK(GPU(GetDeviceProperties)(&props, device));

  constexpr int block_size = 512;
  const int warp_size = props.warpSize;

  vector<int> a(block_size), b(block_size / warp_size);
  iota(begin(a), end(a), 0.0f);

  int *ad, *bd;
  CHECK(GPU(Malloc)(&ad, a.size() * sizeof(int)));
  CHECK(GPU(Malloc)(&bd, b.size() * sizeof(int)));

  CHECK(GPU(Memcpy)(ad, a.data(), a.size() * sizeof(int),
                    GPU(MemcpyHostToDevice)));

  switch (warp_size) {
  case 32:
    kernel<block_size, 32><<<1, block_size>>>(ad, bd);
    break;
  case 64:
    kernel<block_size, 64><<<1, block_size>>>(ad, bd);
    break;
  default:
    cerr << "ERROR: unsupported warp size" << endl;
    return 1;
  }
  CHECK(GPU(GetLastError)());
  CHECK(GPU(DeviceSynchronize)());

  CHECK(GPU(Memcpy)(b.data(), bd, b.size() * sizeof(int),
                    GPU(MemcpyDeviceToHost)));

  CHECK(GPU(Free)(ad));
  CHECK(GPU(Free)(bd));

  cout << "Result:  ";
  for (auto x : b)
    cout << " " << x;
  cout << endl;

  cout << "Expected:";
  bool correct = true;
  for (int i = 0; i < b.size(); ++i) {
    int x = i * warp_size;
    cout << " " << x;
    if (b[i] != x)
      correct = false;
  }
  cout << endl;

  if (!correct) {
    cerr << "Wrong result!" << endl;
    return 1;
  }

  return 0;
}

The text was updated successfully, but these errors were encountered:

ex-rzr · 2021-01-29T13:39:53Z

Just a thought:

Does it work with block_size <= 256?
Does it work if you add __launch_bounds__(512) before the kernel?

fthaler · 2021-04-15T12:15:12Z

Sorry for the late update, both seems to be true. Block sizes <= 256 work and adding launch bounds make it also work with a block size of 512.
Even more important for us, the issue seems to be fixed in ROCm 4.1.1 (I have no access to 4.1.0, but at least in 4.0.0 it was still present).

ex-rzr · 2021-04-15T12:23:40Z

Even more important for us, the issue seems to be fixed in ROCm 4.1.1 (I have no access to 4.1.0, but at least in 4.0.0 it was still present).

This is because of this change:
https://github.com/RadeonOpenCompute/ROCm#performance-impact-for-kernel-launch-bound-attribute

robosina · 2022-03-19T17:48:14Z

I have the same issue on rocm 5.0.2 :(
why using optimization resulted in an incorrect result? I can see that when I used optimization flags(-O1, -O2, -O3) some of my kernels' calculations were set to zero(e.g. 12134231 => 0) , which is incorrect.

fthaler mentioned this issue Jan 7, 2021

Reduction GridTools/gridtools#1590

Merged

fthaler closed this as completed Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

fthaler commented Jan 7, 2021

ex-rzr commented Jan 29, 2021

fthaler commented Apr 15, 2021

ex-rzr commented Apr 15, 2021

robosina commented Mar 19, 2022 •

edited

Loading

Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

Comments

fthaler commented Jan 7, 2021

ex-rzr commented Jan 29, 2021

fthaler commented Apr 15, 2021

ex-rzr commented Apr 15, 2021

robosina commented Mar 19, 2022 • edited Loading

robosina commented Mar 19, 2022 •

edited

Loading