Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: Add support for cublaslt matmul #1972

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dylan-angus-codeplay
Copy link
Contributor

Description

Adds support for using the cublaslt API for IMMA kernels and when the bias and/or relu post-op can be merged into the cublaslt epilogue.

@vpirogov vpirogov added the platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia label Jun 21, 2024
@vpirogov vpirogov added this to the v3.6 milestone Jun 21, 2024
Copy link
Contributor

@dzarukin dzarukin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common part looks good for me. Didn't review cudnn part.

interop_task(matmul_impl_, engine, cgh, cuda_stream, arg_wt,
arg_src, arg_dst, arg_bias, arg_algo_scratch,
arg_bias_scratch, arg_block_a_scratch, arg_block_b_scratch,
arg_block_c_scratch, arg_src_scale, arg_wei_scale,
arg_dst_scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, all these execute functions are almost identical. Would it make sense to use a single execute function in base class in order to avoid boilerplate?
This common execute function could use conditionals to guard some pieces of code, for example:

if (has_runtime_dims())
    init_scratch_buffers(bias_scratch_size, algo_scratch_size);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit which addresses executors and removes dupication

transform_matrix(lt_handle, a_layout_, a, blocked_a_layout_,
block_a_scratch, trans_a_, streamId);
a = block_a_scratch;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the most optimal would be to do the reorders separate from execute call, to let user schedule those and reduce copy overheads.

In any case, is there any benefit to do this reorder inside execute call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without adding this transform inside the excecute the IMMA kernels for the cublasLT will only run with Ab32a for the weights and output with the transform we can support other data formats as well.

The user can do the transform currently for the weights by reordering to Ab32a and passing it to the matmul pd as such, the transform will not be called in this case.

For now we do not have a reorder format mapping to the input matrix to the ampere blocked format (due to the interleaved nature of the blocking) hence we need to do the transform inside the execute.

For the output to schedule a transform at another time the user can set the output of the matmul to Ab32a and reorder after.

(CUdeviceptr)dst_scale, sizeof(float), streamId);
// For eltwise post-ops, apply the dst scale afterward
if (!with_separate_eltwise_) scale /= host_dst_scale;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe cublasLT can be configured to take device pointers (CUBLASLT_POINTER_MODE_DEVICE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, however src,wei and dst scale map to the cublasLT matmul alpha param and the post op sum maps to beta i do not believe using the device pointer mode would work in this case apart from creating a sycl kernel to do
alpha = (1 * src_scale * wei_scale) / dst_scale on device side before passing to the matmul.
Would that be the preferred approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work with oneDNN semantics if only this implementation won't support post-ops. If it would, scales should be separated as src-wei scales are applied before post-ops and dst scales are applied at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commits added src/weight scale for multiple values as well as dst scaling applied after post-ops if any

@vpirogov
Copy link
Member

make test
enable device_gpu
enable thr_sycl
enable thr_cuda

@ShanoToni
Copy link
Contributor

As a note to the latest commit, IMMA kernels do not support output type set to int8 for cublasLT with CUDA versions prior to 12. This is now handled at compile time with a define: if CUDA version is less than 12 primitive descriptor returns not supported for those cases.

PATH_SUFFIXES lib lib64 bin)

include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(
Copy link
Contributor

@densamoilov densamoilov Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible, can you print out the version of cublaslt?

@@ -39,6 +40,7 @@ namespace {
constexpr impl_list_item_t impl_list[] = REG_MATMUL_P({
GPU_INSTANCE_INTEL(intel::ocl::gemm_matmul_t)
GPU_INSTANCE_INTEL_REF(intel::ocl::ref_matmul_t)
GPU_INSTANCE_NVIDIA(nvidia::cudnn_matmul_lt_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put it above the existing cudnn_matmul_t because we expect the new implementation to work better for most of the cases?

for (auto e : evts) {
e.wait();
}
matmul_impl_->cleanup();
Copy link
Contributor

@densamoilov densamoilov Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do cleanup in execute? It seems that the cleanup function merely destroys descriptors.

namespace gpu {
namespace nvidia {

struct cudnn_matmul_base_t : public primitive_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
struct cudnn_matmul_base_t : public primitive_t {
struct cudnn_matmul_base_t : public gpu::primitive_t {

struct cudnn_matmul_base_t : public primitive_t {
using primitive_t::primitive_t;

struct pd_base_t : public gpu_matmul_pd_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use just pd_t in all cases as its meaning is fully defined by the class that encloses it.

Suggested change
struct pd_base_t : public gpu_matmul_pd_t {
struct pd_t : public gpu_matmul_pd_t {

auto arg_dst_scale
= CTX_IN_SYCL_MEMORY(DNNL_ARG_ATTR_SCALES | DNNL_ARG_DST);
protected:
void init_scratch_buffers(std::size_t reorder_scratch_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one concern with the way how we work with the buffers here. They are stored in the primitive object that could be used from multiple threads simultaneously. What is going to happen if we execute the primitive with runtime dimensions from multiple threads? It seems that it could lead to a race condition and result in an undefined behavior.

In general, all primitives in oneDNN are immutable meaning that their state cannot be changed after the initialization is done. This is required for thread safety. The implemented approach breaks the rule.

From SYCL specification: 4.7.2.3. Buffer synchronization rules

The basic rule for the blocking behavior of a buffer destructor is that it blocks if there is some data to write back because a write accessor on it has been created

So, if we move the buffers from the primitive to the execute function then the execution function will become blocking because the buffers will have to be destroyed upon its exit, which doesn't seem like a good solution to me.

I can only see one option to address the issue. We could allocate buffers in the execution function but do not destroy them upon return. Instead, we could destroy them from the host task task, right after the cublaslt operation is finished (we would have to use cuStreamSynchronize for that).

Any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-nvidia Codeowner: @oneapi-src/onednn-gpu-nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants