Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buggy Optimization of Simple Kernel using Shared Memory for Inter-Warp Communication #2212

Closed
fthaler opened this issue Jan 7, 2021 · 4 comments

Comments

@fthaler
Copy link

fthaler commented Jan 7, 2021

The following code produces wrong results when compiled with optimization flags other than -O0. Tested on ROCm 3.9 and 4.0, on gfx906 (AMD Mi50) and gfx908 (AMD Mi100) cards.

Compiled with disabled optimizations, that is hipcc -O0 code.cpp, the compiled code produces the expected output, while higher optimization levels (-O1 or higher) fail. Thus, we assume there is an error in one of the compiler’s optimization passes.

The bug was detected while debugging a GPU reduction code, where the same pattern was is for inter-warp/wavefront reductions.

On NVIDIA GPUs (directly compiled with nvcc -x cu code.cpp), the expected result is produced.

#include <iostream>
#include <numeric>
#include <vector>

#ifdef __CUDACC__
#define GPU(x) cuda##x
#else
#include <hip/hip_runtime.h>
using hipDeviceProp = hipDeviceProp_t;
#define GPU(x) hip##x
#endif

template <int BlockSize, int WarpSize>
__global__ void kernel(int const *__restrict__ in, int *__restrict__ out) {
  // one shared value per warp
  __shared__ int shared[BlockSize / WarpSize];

  // thread zero of each warp puts a value into shared mem
  if (threadIdx.x % WarpSize == 0)
    shared[threadIdx.x / WarpSize] = in[threadIdx.x];

  __syncthreads();

  // move result to global memory
  if (threadIdx.x < BlockSize / WarpSize)
    out[threadIdx.x] = shared[threadIdx.x];
}

int main(int argc, char *argv[]) {
  using namespace std;

#define CHECK(x)                                                               \
  if ((x) != GPU(Success)) {                                                   \
    cerr << "Error in " << #x << " on line " << __LINE__ << endl;              \
    return 1;                                                                  \
  }

  int device;
  CHECK(GPU(GetDevice)(&device));
  GPU(DeviceProp) props;
  CHECK(GPU(GetDeviceProperties)(&props, device));

  constexpr int block_size = 512;
  const int warp_size = props.warpSize;

  vector<int> a(block_size), b(block_size / warp_size);
  iota(begin(a), end(a), 0.0f);

  int *ad, *bd;
  CHECK(GPU(Malloc)(&ad, a.size() * sizeof(int)));
  CHECK(GPU(Malloc)(&bd, b.size() * sizeof(int)));

  CHECK(GPU(Memcpy)(ad, a.data(), a.size() * sizeof(int),
                    GPU(MemcpyHostToDevice)));

  switch (warp_size) {
  case 32:
    kernel<block_size, 32><<<1, block_size>>>(ad, bd);
    break;
  case 64:
    kernel<block_size, 64><<<1, block_size>>>(ad, bd);
    break;
  default:
    cerr << "ERROR: unsupported warp size" << endl;
    return 1;
  }
  CHECK(GPU(GetLastError)());
  CHECK(GPU(DeviceSynchronize)());

  CHECK(GPU(Memcpy)(b.data(), bd, b.size() * sizeof(int),
                    GPU(MemcpyDeviceToHost)));

  CHECK(GPU(Free)(ad));
  CHECK(GPU(Free)(bd));

  cout << "Result:  ";
  for (auto x : b)
    cout << " " << x;
  cout << endl;

  cout << "Expected:";
  bool correct = true;
  for (int i = 0; i < b.size(); ++i) {
    int x = i * warp_size;
    cout << " " << x;
    if (b[i] != x)
      correct = false;
  }
  cout << endl;

  if (!correct) {
    cerr << "Wrong result!" << endl;
    return 1;
  }

  return 0;
}
@ex-rzr
Copy link
Contributor

ex-rzr commented Jan 29, 2021

Just a thought:

  1. Does it work with block_size <= 256?
  2. Does it work if you add __launch_bounds__(512) before the kernel?

@fthaler
Copy link
Author

fthaler commented Apr 15, 2021

Sorry for the late update, both seems to be true. Block sizes <= 256 work and adding launch bounds make it also work with a block size of 512.
Even more important for us, the issue seems to be fixed in ROCm 4.1.1 (I have no access to 4.1.0, but at least in 4.0.0 it was still present).

@fthaler fthaler closed this as completed Apr 15, 2021
@ex-rzr
Copy link
Contributor

ex-rzr commented Apr 15, 2021

Even more important for us, the issue seems to be fixed in ROCm 4.1.1 (I have no access to 4.1.0, but at least in 4.0.0 it was still present).

This is because of this change:
https://github.com/RadeonOpenCompute/ROCm#performance-impact-for-kernel-launch-bound-attribute

@robosina
Copy link

robosina commented Mar 19, 2022

I have the same issue on rocm 5.0.2 :(
why using optimization resulted in an incorrect result? I can see that when I used optimization flags(-O1, -O2, -O3) some of my kernels' calculations were set to zero(e.g. 12134231 => 0) , which is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants