gpu: nvidia: Add support for cublaslt matmul #1972

dylan-angus-codeplay · 2024-06-21T15:36:08Z

Description

Adds support for using the cublaslt API for IMMA kernels and when the bias and/or relu post-op can be merged into the cublaslt epilogue.

dzarukin

Common part looks good for me. Didn't review cudnn part.

mgouicem · 2024-06-25T07:22:33Z

src/gpu/nvidia/cudnn_matmul_executor.hpp

+            interop_task(matmul_impl_, engine, cgh, cuda_stream, arg_wt,
+                    arg_src, arg_dst, arg_bias, arg_algo_scratch,
+                    arg_bias_scratch, arg_block_a_scratch, arg_block_b_scratch,
+                    arg_block_c_scratch, arg_src_scale, arg_wei_scale,
                    arg_dst_scale);


IIUC, all these execute functions are almost identical. Would it make sense to use a single execute function in base class in order to avoid boilerplate?
This common execute function could use conditionals to guard some pieces of code, for example:

if (has_runtime_dims()) init_scratch_buffers(bias_scratch_size, algo_scratch_size);

Commit which addresses executors and removes dupication

mgouicem · 2024-06-25T07:28:08Z

src/gpu/nvidia/cudnn_matmul_lt_impl.hpp

+                transform_matrix(lt_handle, a_layout_, a, blocked_a_layout_,
+                        block_a_scratch, trans_a_, streamId);
+                a = block_a_scratch;
+            }


I guess the most optimal would be to do the reorders separate from execute call, to let user schedule those and reduce copy overheads.

In any case, is there any benefit to do this reorder inside execute call?

Without adding this transform inside the excecute the IMMA kernels for the cublasLT will only run with Ab32a for the weights and output with the transform we can support other data formats as well.

The user can do the transform currently for the weights by reordering to Ab32a and passing it to the matmul pd as such, the transform will not be called in this case.

For now we do not have a reorder format mapping to the input matrix to the ampere blocked format (due to the interleaved nature of the blocking) hence we need to do the transform inside the execute.

For the output to schedule a transform at another time the user can set the output of the matmul to Ab32a and reorder after.

mgouicem · 2024-06-25T07:30:33Z

src/gpu/nvidia/cudnn_matmul_lt_impl.hpp

+                    (CUdeviceptr)dst_scale, sizeof(float), streamId);
+            // For eltwise post-ops, apply the dst scale afterward
+            if (!with_separate_eltwise_) scale /= host_dst_scale;
+        }


I believe cublasLT can be configured to take device pointers (CUBLASLT_POINTER_MODE_DEVICE)

That is true, however src,wei and dst scale map to the cublasLT matmul alpha param and the post op sum maps to beta i do not believe using the device pointer mode would work in this case apart from creating a sycl kernel to do
alpha = (1 * src_scale * wei_scale) / dst_scale on device side before passing to the matmul.
Would that be the preferred approach?

It would work with oneDNN semantics if only this implementation won't support post-ops. If it would, scales should be separated as src-wei scales are applied before post-ops and dst scales are applied at the end.

Latest commits added src/weight scale for multiple values as well as dst scaling applied after post-ops if any

vpirogov · 2024-06-25T21:28:18Z

make test
enable device_gpu
enable thr_sycl
enable thr_cuda

ShanoToni · 2024-06-27T16:31:33Z

As a note to the latest commit, IMMA kernels do not support output type set to int8 for cublasLT with CUDA versions prior to 12. This is now handled at compile time with a define: if CUDA version is less than 12 primitive descriptor returns not supported for those cases.

Matmul PD refactor Skipping unsupported tests for lt impl Addressed MR comments

Added checks to new bias

Adding additional checks for lt impl Rebase cleanup

densamoilov · 2024-07-30T22:55:18Z

cmake/FindcublasLt.cmake

+  PATH_SUFFIXES lib lib64 bin)
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(


If it's possible, can you print out the version of cublaslt?

densamoilov · 2024-07-30T22:56:20Z

src/gpu/gpu_matmul_list.cpp

@@ -39,6 +40,7 @@ namespace {
 constexpr impl_list_item_t impl_list[] = REG_MATMUL_P({
        GPU_INSTANCE_INTEL(intel::ocl::gemm_matmul_t)
        GPU_INSTANCE_INTEL_REF(intel::ocl::ref_matmul_t)
+        GPU_INSTANCE_NVIDIA(nvidia::cudnn_matmul_lt_t)


We put it above the existing cudnn_matmul_t because we expect the new implementation to work better for most of the cases?

densamoilov · 2024-07-30T23:00:05Z

src/gpu/nvidia/cudnn_matmul.cpp

+        for (auto e : evts) {
+            e.wait();
+        }
+        matmul_impl_->cleanup();


Why do we do cleanup in execute? It seems that the cleanup function merely destroys descriptors.

densamoilov · 2024-07-30T23:22:13Z

src/gpu/nvidia/cudnn_matmul_base.hpp

+namespace gpu {
+namespace nvidia {
+
+struct cudnn_matmul_base_t : public primitive_t {


(nit)

Suggested change

struct cudnn_matmul_base_t : public primitive_t {

struct cudnn_matmul_base_t : public gpu::primitive_t {

densamoilov · 2024-07-30T23:23:33Z

src/gpu/nvidia/cudnn_matmul_base.hpp

+struct cudnn_matmul_base_t : public primitive_t {
+    using primitive_t::primitive_t;
+
+    struct pd_base_t : public gpu_matmul_pd_t {


We usually use just pd_t in all cases as its meaning is fully defined by the class that encloses it.

Suggested change

struct pd_base_t : public gpu_matmul_pd_t {

struct pd_t : public gpu_matmul_pd_t {

densamoilov · 2024-07-31T00:02:15Z

src/gpu/nvidia/cudnn_matmul_executor.hpp

-            auto arg_dst_scale
-                    = CTX_IN_SYCL_MEMORY(DNNL_ARG_ATTR_SCALES | DNNL_ARG_DST);
+protected:
+    void init_scratch_buffers(std::size_t reorder_scratch_size,


I have one concern with the way how we work with the buffers here. They are stored in the primitive object that could be used from multiple threads simultaneously. What is going to happen if we execute the primitive with runtime dimensions from multiple threads? It seems that it could lead to a race condition and result in an undefined behavior.

In general, all primitives in oneDNN are immutable meaning that their state cannot be changed after the initialization is done. This is required for thread safety. The implemented approach breaks the rule.

From SYCL specification: 4.7.2.3. Buffer synchronization rules

The basic rule for the blocking behavior of a buffer destructor is that it blocks if there is some data to write back because a write accessor on it has been created

So, if we move the buffers from the primitive to the execute function then the execution function will become blocking because the buffers will have to be destroyed upon its exit, which doesn't seem like a good solution to me.

I can only see one option to address the issue. We could allocate buffers in the execution function but do not destroy them upon return. Instead, we could destroy them from the host task task, right after the cublaslt operation is finished (we would have to use cuStreamSynchronize for that).

Any thoughts?

vpirogov added the platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia label Jun 21, 2024

vpirogov added this to the v3.6 milestone Jun 21, 2024

dzarukin approved these changes Jun 21, 2024

View reviewed changes

mgouicem reviewed Jun 25, 2024

View reviewed changes

ShanoToni force-pushed the cublaslt_impl branch from f98fd38 to 147bb3a Compare July 29, 2024 17:32

dylan-angus-codeplay and others added 6 commits July 30, 2024 16:13

gpu: nvidia: Add support for cublaslt matmul

fc4292e

Refactor of matmul executors

91b7121

Added new separate classes for matmul pd WIP

d590f36

Matmul PD refactor Skipping unsupported tests for lt impl Addressed MR comments

Adding bias to IMMA lt matmul case

0d90203

Added checks to new bias

Added src/weight & dst scaling to IMMA lt matmul case

a9eab77

Adding additional checks for lt impl Rebase cleanup

Better checks for matmul Lt scales on init

8ab495f

ShanoToni force-pushed the cublaslt_impl branch from a5fc714 to 8ab495f Compare July 30, 2024 15:15

densamoilov reviewed Jul 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: nvidia: Add support for cublaslt matmul #1972

gpu: nvidia: Add support for cublaslt matmul #1972

dylan-angus-codeplay commented Jun 21, 2024

dzarukin left a comment

mgouicem Jun 25, 2024

ShanoToni Jul 30, 2024

mgouicem Jun 25, 2024

ShanoToni Jun 25, 2024

mgouicem Jun 25, 2024

ShanoToni Jun 27, 2024

dzarukin Jul 1, 2024

ShanoToni Jul 30, 2024

vpirogov commented Jun 25, 2024

ShanoToni commented Jun 27, 2024

densamoilov Jul 30, 2024 •

edited

Loading

densamoilov Jul 30, 2024

densamoilov Jul 30, 2024 •

edited

Loading

densamoilov Jul 30, 2024

densamoilov Jul 30, 2024

densamoilov Jul 31, 2024

	struct cudnn_matmul_base_t : public primitive_t {
	struct cudnn_matmul_base_t : public gpu::primitive_t {

	struct pd_base_t : public gpu_matmul_pd_t {
	struct pd_t : public gpu_matmul_pd_t {

gpu: nvidia: Add support for cublaslt matmul #1972

Are you sure you want to change the base?

gpu: nvidia: Add support for cublaslt matmul #1972

Conversation

dylan-angus-codeplay commented Jun 21, 2024

Description

dzarukin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpirogov commented Jun 25, 2024

ShanoToni commented Jun 27, 2024

densamoilov Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densamoilov Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densamoilov Jul 30, 2024 •

edited

Loading

densamoilov Jul 30, 2024 •

edited

Loading