Update replacing MultiHeadAttention with GroupQueryAttention #19882

kunal-vaishnavi · 2024-03-13T06:15:14Z

Description

This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in this PR.

Motivation and Context

The updated replacement of MHA with GQA includes the following fusion changes.

Apply sliding window within GQA
Fuse the rotary embeddings within GQA
Fuse the 3 MatMuls into 1 packed MatMul if possible
Fuse the 3 Adds into 1 packed Add if possible

### Description This PR updates the replacement of MultiHeadAttention (MHA) with GroupQueryAttention (GQA). It is related to the changes in [this PR](#18906). ### Motivation and Context The updated replacement of MHA with GQA includes the following fusion changes. - Apply sliding window within GQA - Fuse the rotary embeddings within GQA - Fuse the 3 MatMuls into 1 packed MatMul if possible - Fuse the 3 Adds into 1 packed Add if possible

Update replacing MHA with GQA

f9492a1

aciddelgado approved these changes Mar 13, 2024

View reviewed changes

kunal-vaishnavi merged commit 4ac98d6 into microsoft:main Mar 13, 2024
94 checks passed

kunal-vaishnavi added the release:1.17.3 label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update replacing MultiHeadAttention with GroupQueryAttention #19882

Update replacing MultiHeadAttention with GroupQueryAttention #19882

kunal-vaishnavi commented Mar 13, 2024

Update replacing MultiHeadAttention with GroupQueryAttention #19882

Update replacing MultiHeadAttention with GroupQueryAttention #19882

Conversation

kunal-vaishnavi commented Mar 13, 2024

Description

Motivation and Context