Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Attention kernel provider option #21344

Merged
merged 13 commits into from
Jul 19, 2024
Merged

Conversation

tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Jul 13, 2024

Description

  • Add a cuda provider option sdpa_kernel to choose which attention kernel to run for testing purpose.
  • Allow dump which attention kernel is used per node.
  • Reserve a flag for cudnn flash attention which will be added soon.

CUDA provider option sdpa_kernel

Instead of setting environment variable, we also support setting it in provider option. Note that the setting is global per session. That could help performance testing of each kernel.

Attention Kernel Debug Info

Set an environment variable ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1, and ORT will print sdpa kernel used in each node:

For example

ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1 ./onnxruntime_test_all --gtest_filter=MultiHeadAttentionTest*

It will show debug information of kernel used in testing:

[ RUN      ] MultiHeadAttentionTest.SelfAttention_Batch2_HeadSize32_NoBias_NoMask_PackedQKV
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=0 TRT_FUSED_ATTENTION=1 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=1 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 TRT_FUSED_ATTENTION=1
AttentionKernelOptions: FLASH_ATTENTION=0 EFFICIENT_ATTENTION=1 TRT_FUSED_ATTENTION=0 CUDNN_FLASH_ATTENTION=0 TRT_FLASH_ATTENTION=0 TRT_CROSS_ATTENTION=0 TRT_CAUSAL_ATTENTION=0 MATH=1
Operator=MultiHeadAttention Node=node1 DataType=fp16 EFFICIENT_ATTENTION=1

In this test case, the debug info shows that one session uses trt fused attention and another session use efficient attention.

Motivation and Context

@tianleiwu tianleiwu marked this pull request as draft July 13, 2024 01:18
@@ -0,0 +1,70 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.
@@ -0,0 +1,149 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.
@tianleiwu tianleiwu marked this pull request as ready for review July 18, 2024 01:19
@tianleiwu tianleiwu merged commit 6ffaaeb into main Jul 19, 2024
90 of 97 checks passed
@tianleiwu tianleiwu deleted the tlwu/attention_kernel_cuda_option branch July 19, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants