GQA Rotary and Packed QKV with Flash #18906

aciddelgado · 2023-12-21T17:23:22Z

Description

These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well.

Motivation and Context

With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V.

onnxruntime/core/graph/contrib_ops/bert_defs.cc

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu

onnxruntime/core/graph/contrib_ops/bert_defs.cc

tools/ci_build/requirements-transformers-test.txt

onnxruntime/contrib_ops/cuda/bert/group_query_attention_helper.h

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc

onnxruntime/contrib_ops/cuda/bert/group_query_attention.h

### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

### Description This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in [this PR](#18906). ### Motivation and Context The updated replacement of MHA with GQA includes the following fusion changes. - Apply sliding window within GQA - Fuse the rotary embeddings within GQA - Fuse the 3 MatMuls into 1 packed MatMul if possible - Fuse the 3 Adds into 1 packed Add if possible

aciddelgado added 30 commits October 12, 2023 13:46

squash merge

ef82d4d

add unit test and fix build

54f2526

undo work in attention_impl file

e2e1157

reduce tests and change default behavior for past-kv is nullptr

415440b

test compatibility w/ no cuda

0e84d2d

exclude from amd

9ba6963

fix test script

6573133

work on local attention flash

bd79b6d

vscode idk

fb8e386

make kernels more efficient and make present output required

a87f211

Merge branch 'main' into aciddelgado/gqa_memeff_v2

16bda28

merge main and memeff changes

08f553d

address comments

eb12522

update ContribOperators.md

6c6aead

merge main

6e540e3

Merge branch 'aciddelgado/gqa_memeff_v2' into aciddelgado/gqa_local

b3a9d0f

local working with flash not memeff

3d7f3bf

clarify input and output formats memory efficient attention

db307f3

max sequence length for memory efficient attention

e7a50ee

clang and fix test file

bc1cf0a

undo clang on unrelated files

44ca857

Merge branch 'main' into aciddelgado/gqa_memeff_v2

f08495f

check value and key inputs

660c8fd

key and value dont check for nullptr since they are required

b9c4d15

fix up test script

afe22a4

fix packedmha, clean test, merge gqa_memeff_v2 branch changes

0125678

Merge branch 'main' into aciddelgado/gqa_local

353c4f5

local working with recent changes

94b5efb

no local w memeff

ddb7a66

undo unnecessary changes

d33c69b

tianleiwu reviewed Jan 2, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Jan 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Outdated Show resolved Hide resolved

tianleiwu reviewed Jan 4, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu Show resolved Hide resolved

aciddelgado added 8 commits January 11, 2024 08:43

Merge branch 'main' into aciddelgado/gqa_rotary_packed

a821588

test

b473283

run pipeline

152d920

retrigger checks

f271a74

conflict and requirements change

ba13a3f

disable transformers test

723637e

add todo and format

87533ef

fix lint issue

615d500

yufenglee reviewed Jan 13, 2024

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated Show resolved Hide resolved

merge conflict

dba1e7e

tianleiwu reviewed Jan 22, 2024

View reviewed changes

tools/ci_build/requirements-transformers-test.txt Show resolved Hide resolved

tianleiwu reviewed Jan 22, 2024

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/group_query_attention_helper.h Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/bert/group_query_attention.h Outdated Show resolved Hide resolved

tianleiwu previously approved these changes Jan 22, 2024

View reviewed changes

address comments

e7863b3

aciddelgado dismissed tianleiwu’s stale review via e7863b3 January 22, 2024 21:27

tianleiwu previously approved these changes Jan 22, 2024

View reviewed changes

lintrunner

5b55424

aciddelgado dismissed tianleiwu’s stale review via 5b55424 January 22, 2024 22:50

tianleiwu approved these changes Jan 23, 2024

View reviewed changes

aciddelgado merged commit cbb29d8 into main Jan 24, 2024
95 of 98 checks passed

aciddelgado deleted the aciddelgado/gqa_rotary_packed branch January 24, 2024 00:34

tianleiwu added the release:1.17.0 label Jan 26, 2024

tianleiwu mentioned this pull request Jan 30, 2024

[ORT 1.17.0 Release] Cherry-pick Final Round #19327

Merged

kunal-vaishnavi mentioned this pull request Mar 13, 2024

Update replacing MultiHeadAttention with GroupQueryAttention #19882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GQA Rotary and Packed QKV with Flash #18906

GQA Rotary and Packed QKV with Flash #18906

aciddelgado commented Dec 21, 2023

GQA Rotary and Packed QKV with Flash #18906

GQA Rotary and Packed QKV with Flash #18906

Conversation

aciddelgado commented Dec 21, 2023

Description

Motivation and Context