Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm gfx10 and gfx11 not supported #1429

Open
Madouura opened this issue Oct 15, 2023 · 30 comments
Open

ROCm gfx10 and gfx11 not supported #1429

Madouura opened this issue Oct 15, 2023 · 30 comments

Comments

@Madouura
Copy link

Looks like only gfx9 and below is supported here ATM.
gfx1010, gfx1012, gfx1030, gfx1100, gfx1101, and gfx1102 are affected.
Here's the error log. I'll link you to the build environment once I submit a PR to nixpkgs.

ginkgo-hpc-hip> In file included from /build/source/build/hip/preconditioner/jacobi_advanced_apply_instantiate.64.hip.cpp:66:
ginkgo-hpc-hip> /build/source/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernel.hpp.inc:84:9: error: no matching function for call to 'tiled_partition'
ginkgo-hpc-hip>         group::tiled_partition<subwarp_size>(group::this_thread_block());
ginkgo-hpc-hip>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ginkgo-hpc-hip> /build/source/build/hip/preconditioner/jacobi_advanced_apply_instantiate.64.hip.cpp:98:21: note: in instantiation of function template specialization 'gko::kernels::hip::jacobi::kernel::advanced_adaptive_apply<64, 64, 4, float, int>' requested here
ginkgo-hpc-hip>             kernel::advanced_adaptive_apply<max_block_size, subwarp_size,
ginkgo-hpc-hip>                     ^
ginkgo-hpc-hip> /build/source/hip/components/cooperative_groups.hip.hpp:392:5: note: candidate template ignored: requirement '64UL <= kernels::hip::config::warp_size' was not satisfied [with Size = 64]
ginkgo-hpc-hip>     tiled_partition(const Group&)
ginkgo-hpc-hip>     ^
ginkgo-hpc-hip> In file included from /build/source/build/hip/preconditioner/jacobi_advanced_apply_instantiate.64.hip.cpp:66:
ginkgo-hpc-hip> /build/source/common/cuda_hip/preconditioner/jacobi_advanced_apply_kernel.hpp.inc:50:9: error: no matching function for call to 'tiled_partition'
ginkgo-hpc-hip>         group::tiled_partition<subwarp_size>(group::this_thread_block());
ginkgo-hpc-hip>         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ginkgo-hpc-hip> /build/source/build/hip/preconditioner/jacobi_advanced_apply_instantiate.64.hip.cpp:105:21: note: in instantiation of function template specialization 'gko::kernels::hip::jacobi::kernel::advanced_apply<64, 64, 4, float, int>' requested here
ginkgo-hpc-hip>             kernel::advanced_apply<max_block_size, subwarp_size,
ginkgo-hpc-hip>
@upsj
Copy link
Member

upsj commented Oct 15, 2023

How did you determine the affected versions? It looks like they still use wavefronts of 64 threads, so there might be something going wrong with the macro we use to determine whether we are compiling for CUDA or ROCm.

@Madouura
Copy link
Author

Madouura commented Oct 15, 2023

Simple. Ran the build like 6 times and added whichever ones I saw were failing in the log to the gpu targets exclude list in my package.

@Madouura
Copy link
Author

PR: #261155
Relevant files: NixOS/nixpkgs@93464fe

@upsj
Copy link
Member

upsj commented Oct 15, 2023

Can reproduce this outside nixos

@upsj
Copy link
Member

upsj commented Oct 15, 2023

First observations: This is the failing command

/opt/rocm-5.6.0/bin/hipcc
-c /home/tribizel/ginkgo/build-tmp/hip/preconditioner/jacobi_advanced_apply_instantiate.64.hip.cpp
-o /home/tribizel/ginkgo/build-tmp/hip/CMakeFiles/ginkgo_hip.dir/preconditioner/./ginkgo_hip_generated_jacobi_advanced_apply_instantiate.64.hip.cpp.o
-std=c++14
-DGKO_COMPILING_HIP
--amdgpu-target=gfx1010
-fPIC
-fPIC
-O3
-DNDEBUG
-I/opt/rocm-5.6.0/include -I/home/tribizel/ginkgo/build-tmp/hip/.. -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/home/tribizel/ginkgo/build-tmp/include -I/home/tribizel/ginkgo/include -I/home/tribizel/ginkgo -I/usr/local/include -I/home/tribizel/ginkgo/build-tmp/include -I/home/tribizel/ginkgo/include -I/home/tribizel/ginkgo -I/usr/local/include -I/usr/local/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include/hiprand -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include/rocrand -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include -I/opt/rocm-5.6.0/include

Removing --amdgpu-target (which is a deprecated flag BTW, we should change it) fixes it. hipcc calls two compilation passes (one host, one device), where the second one includes the generated GPU binary blob. With --amdgpu-target, the first compilation already fails. This might have to do with /opt/rocm-5.6.0/amdgcn/bitcode/oclc_wavefrontsize64_off.bc that is being added for gfx1010.

@Madouura
Copy link
Author

Madouura commented Oct 15, 2023

You mean here?

list(APPEND GINKGO_AMD_ARCH_FLAGS --amdgpu-target=${target})

I already replace it with --offload-arch.

# `--amdgpu-target` is deprecated
substituteInPlace cmake/hip.cmake \
  --replace "--amdgpu-target" "--offload-arch"

@Madouura
Copy link
Author

Madouura commented Oct 15, 2023

Removing list(APPEND GINKGO_AMD_ARCH_FLAGS --amdgpu-target=${target}) works, but I'm not sure if that should be done.
We specify those flags for a reason right?

@upsj
Copy link
Member

upsj commented Oct 15, 2023

Probable fix: https://llvm.org/docs/AMDGPUUsage.html#target-features use -mwavefrontsize64 for those targets. Not sure if it has an effect in targets that support size 64 wavefronts natively.

@Madouura
Copy link
Author

Madouura commented Oct 15, 2023

I'm getting this error when prepending -mwavefrontsize64 to --amdgpu-target.

In file included from /build/source/include/ginkgo/core/base/exception.hpp:41:
ginkgo-hpc-hip> In file included from /build/source/include/ginkgo/core/base/types.hpp:50:
ginkgo-hpc-hip> /nix/store/7y1f77gd62zy52b52ivapa4inkjcb9mq-clr-5.7.0/include/hip/hip_runtime.h:41:2: error: HIP is not supported on the specified GPU ARCH with wavefront size 64
ginkgo-hpc-hip> #error HIP is not supported on the specified GPU ARCH with wavefront size 64
ginkgo-hpc-hip>  ^
ginkgo-hpc-hip> [ 20%] Building CXX object reference/CMakeFiles/ginkgo_reference.dir/solver/cb_gmres_kernels.cpp.o
ginkgo-hpc-hip> 1 warning and 1 error generated when compiling for gfx1010.
ginkgo-hpc-hip> CMake Error at ginkgo_hip_generated_residual_norm_kernels.hip.cpp.o.cmake:200 (message):
ginkgo-hpc-hip>   Error generating file
ginkgo-hpc-hip>   /build/source/build/hip/CMakeFiles/ginkgo_hip.dir/stop/./ginkgo_hip_generated_residual_norm_kernels.hip.cpp.o
ginkgo-hpc-hip> 
ginkgo-hpc-hip> make[2]: *** [hip/CMakeFiles/ginkgo_hip.dir/build.make:399: hip/CMakeFiles/ginkgo_hip.dir/stop/ginkgo_hip_generated_residual_norm_kernels.hip.cpp.o] Error 1
ginkgo-hpc-hip> make[2]: *** Waiting for unfinished jobs....
ginkgo-hpc-hip> [ 21%] Building CXX object reference/CMakeFiles/ginkgo_reference.dir/solver/common_gmres_kernels.cpp.o
#if __HIP_DEVICE_COMPILE__ && !__GFX8__ && !__GFX9__ && __AMDGCN_WAVEFRONT_SIZE == 64
#error HIP is not supported on the specified GPU ARCH with wavefront size 64
#endif

@upsj
Copy link
Member

upsj commented Oct 15, 2023

Thanks for the report, that's disappointing. It should be possible to compile Ginkgo with wavefront size 32, but unfortunately we don't have any gfx10/11 GPUs available right now to test this, and I'm not comfortable claiming support for it without checking that the tests run correctly. Some dependencies (rocBLAS, rocSPARSE, rocFFT) might also not work on them? Though I haven't checked yet.

@Madouura
Copy link
Author

If you can give me a patch set I can give you at least a preliminary indicator of whether it works or not.
I'm on gfx1030. Theoretically if it works for that it should work for 1010, 1012, 1100, 1101, and 1102.

@upsj
Copy link
Member

upsj commented Oct 15, 2023

I just noticed we are doing this almost correctly already. As a first check, you could try

diff --git a/hip/CMakeLists.txt b/hip/CMakeLists.txt
index 21b573b6cd..6740a83d99 100644
--- a/hip/CMakeLists.txt
+++ b/hip/CMakeLists.txt
@@ -69,7 +69,3 @@ endif()
 
-if (GINKGO_HIP_PLATFORM MATCHES "${HIP_PLATFORM_NVIDIA_REGEX}")
-    set(GKO_HIP_JACOBI_MAX_BLOCK_SIZE 32)
-else()
-    set(GKO_HIP_JACOBI_MAX_BLOCK_SIZE 64)
-endif()
+set(GKO_HIP_JACOBI_MAX_BLOCK_SIZE 32)
 if(GINKGO_JACOBI_FULL_OPTIMIZATIONS)

and check which tests run correctly. I only see a handful that will definitely fail.

@Madouura
Copy link
Author

Madouura commented Oct 15, 2023

Looks like that fixed pretty much everything.
Going to try that test with only one GPU target and see if everything passes.
https://github.com/NixOS/nixpkgs/blob/59dabc593819109a57df9bf03731ecf217beb9dc/pkgs/development/libraries/ginkgo-hpc/default.nix#L175-L188
ginkgo-test.log

@upsj
Copy link
Member

upsj commented Oct 15, 2023

Note that this is not a complete fix, it's is intended only a workaround for consumer-grade GPUs. Though I'm not sure if we actually do need larger than 32 Jacobi blocks.

@Madouura
Copy link
Author

ginkgo-test.log
Effectively the same test output using only gfx1030.

@upsj
Copy link
Member

upsj commented Oct 15, 2023

That's how it looks when no kernels are being run in CUDA/we compiled for the wrong architecture. We don't seem to be catching errors of that kind explicitly. I guess it's safe to say that for now, we can only support server-grade GPUs (Radeon VII and Instinct series)

@yhmtsai
Copy link
Member

yhmtsai commented Oct 15, 2023

Could you also try appending the following change in hip/base/config.hip.hpp?

diff --git a/hip/base/config.hip.hpp b/hip/base/config.hip.hpp
index 8a53037ffa..ea33ce5f09 100644
--- a/hip/base/config.hip.hpp
+++ b/hip/base/config.hip.hpp
@@ -56,7 +56,7 @@ struct config {
      * The type containing a bitmask over all lanes of a warp.
      */
 #if GINKGO_HIP_PLATFORM_HCC
-    using lane_mask_type = uint64;
+    using lane_mask_type = uint32;
 #else  // GINKGO_HIP_PLATFORM_NVCC
     using lane_mask_type = uint32;
 #endif
@@ -66,7 +66,7 @@ struct config {
      * `device_functions.h`.
      */
 #if GINKGO_HIP_PLATFORM_HCC
-    static constexpr uint32 warp_size = warpSize;
+    static constexpr uint32 warp_size = 32;
 #else  // GINKGO_HIP_PLATFORM_NVCC
     static constexpr uint32 warp_size = 32;
 #endif

warpSize to 32 might not be neccessary because AMD should use 32 already for those archs.

@upsj
Copy link
Member

upsj commented Oct 15, 2023

@yhmtsai that likely won't make a difference, since uint32 and uint64 are inter-convertible. The use of warpSize is why this issue came up in the first place, otherwise we wouldn't be trying to instantiate the 64 blocksize Jacobi kernels.

@yhmtsai
Copy link
Member

yhmtsai commented Oct 15, 2023

The voting function will consider the all bits of mask type.

@upsj
Copy link
Member

upsj commented Oct 15, 2023

The result of __ballot gets cast to uint64 from uint32, which means the upper 32 bits are zero

@yhmtsai
Copy link
Member

yhmtsai commented Oct 15, 2023

Yes, but it only works for checking any but does not work for all because some of the bits will be zero.
The all will always be false. We did safer operations on SYCL backend but in HIP we assume the mask type bits == the wavefront size.
I thinkg ctz or something related function also rely on the length of mask type.
But of cousre, there are something I might miss because I do not go into detail for checking.

@upsj
Copy link
Member

upsj commented Oct 15, 2023

I think we're safe there:

        if (Size == config::warp_size) {
            return __all(predicate);
        } else {
            return (__ballot(predicate) & data_.mask) == data_.mask;
        }

but you are right, many of the failing tests use either cooperative groups or rocBLAS.

@Madouura
Copy link
Author

Could you also try appending the following change in hip/base/config.hip.hpp?

diff --git a/hip/base/config.hip.hpp b/hip/base/config.hip.hpp
index 8a53037ffa..ea33ce5f09 100644
--- a/hip/base/config.hip.hpp
+++ b/hip/base/config.hip.hpp
@@ -56,7 +56,7 @@ struct config {
      * The type containing a bitmask over all lanes of a warp.
      */
 #if GINKGO_HIP_PLATFORM_HCC
-    using lane_mask_type = uint64;
+    using lane_mask_type = uint32;
 #else  // GINKGO_HIP_PLATFORM_NVCC
     using lane_mask_type = uint32;
 #endif
@@ -66,7 +66,7 @@ struct config {
      * `device_functions.h`.
      */
 #if GINKGO_HIP_PLATFORM_HCC
-    static constexpr uint32 warp_size = warpSize;
+    static constexpr uint32 warp_size = 32;
 #else  // GINKGO_HIP_PLATFORM_NVCC
     static constexpr uint32 warp_size = 32;
 #endif

warpSize to 32 might not be neccessary because AMD should use 32 already for those archs.

Sorry for the late reply.
With these changes more tests are passing.
ginkgo-test.log

@upsj
Copy link
Member

upsj commented Oct 16, 2023

That is encouraging. Based on the test output, I would be so bold to put the blame for all non-MPI failures on rocBLAS, which we use for our GEMM operations, which we use in all tests for the system matrix. Can you try running the rocBLAS test suite?

@Madouura
Copy link
Author

I assume you're talking about gtest in https://github.com/ROCmSoftwarePlatform/rocBLAS/tree/develop/clients.
It will take me a bit of time since I'll need to rearrange the derivation, but sure.

@upsj
Copy link
Member

upsj commented Oct 16, 2023

I'll also have a gfx1102 available soon, so if this is too much effort, I can also take over ;)

@Madouura
Copy link
Author

I need to rearrange a lot of the rocm derivations to use a layout more like ginkgo in the linked PR.
Basically, I'll get to it. Just might take a while since rocBLAS is one of the later things in the toolchain.
By all means though, if you can get to it faster please do.

@upsj
Copy link
Member

upsj commented Oct 16, 2023

I would just build rocBLAS separately outside of nix, since I would assume if it's an issue, it's less likely to be a configuration issue, and more likely AMD not testing rocBLAS extensively on all of their GPUs.

@Madouura
Copy link
Author

I would just build rocBLAS separately outside of nix, since I would assume if it's an issue, it's less likely to be a configuration issue

While likely not a nix issue, I'm not going to discount it's possibility. I can probably spin up a VM or something and do some A-B testing with each having a rocBLAS installation to make sure.

and more likely AMD not testing rocBLAS extensively on all of their GPUs

I've packaged most of their software, and I can tell you that's very likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants