[ORTModule] ATen Efficient Attention and Triton Flash Attention #17959

er3x3 · 2023-10-16T08:27:02Z

This PR is to support efficient attention and flash attention in ORTModule, including:

Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable.
Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable.
A python transformer tool to match sub-graph by config and write transformer quickly.

Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added.

The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.

github-advanced-security

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

orttraining/orttraining/python/training/ort_triton/kernel/_flash_attn.py

orttraining/orttraining/python/training/ortmodule/transformers/utils.py

orttraining/orttraining/python/training/ort_triton/kernel/_flash_attn.py

orttraining/orttraining/python/training/ortmodule/transformers/__init__.py

onnxruntime/contrib_ops/cpu/aten_ops/aten_op.cc

…osoft#17959) This PR is to support efficient attention and flash attention in ORTModule, including: - Use ATen to call efficient attention, which requires PyTorch 2.2.0 dev or newer. ORTMODULE_USE_EFFICIENT_ATTENTION=1 to enable. - Integrate Triton Flash attention, which requires triton==2.0.0.dev20221202. Need A100 or H100. ORTMODULE_USE_FLASH_ATTENTION=1 to enable. - A python transformer tool to match sub-graph by config and write transformer quickly. Current transformers supports attention mask for both efficient attn and flash attn, and dropout for efficient attn only. To support more training scenarios (such as causal mask in GPT2), more transformers need to be added. The feature is guarded by system environment variables, it won't effect any current behavior if not enabled. Since it requires specific PyTorch/Triton versions, related tests is not added for now.

er3x3 requested a review from askhade October 16, 2023 08:27

github-advanced-security bot found potential problems Oct 16, 2023

View reviewed changes

orttraining/orttraining/python/training/ort_triton/kernel/_flash_attn.py Fixed Show fixed Hide fixed

orttraining/orttraining/python/training/ortmodule/transformers/utils.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 17, 2023

View reviewed changes

orttraining/orttraining/python/training/ort_triton/kernel/_flash_attn.py Fixed Show fixed Hide fixed

askhade reviewed Oct 18, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/transformers/__init__.py Outdated Show resolved Hide resolved

askhade reviewed Oct 18, 2023

View reviewed changes

onnxruntime/contrib_ops/cpu/aten_ops/aten_op.cc Show resolved Hide resolved

askhade approved these changes Oct 25, 2023

View reviewed changes

er3x3 force-pushed the weicwang/attn branch from 313a5c6 to 3da5bce Compare October 26, 2023 02:47

er3x3 added 4 commits October 26, 2023 18:03

aten attn and flash attn

dba1e6b

move cpu arg to kwargs to avoid D2H copy

e4332c2

add llama2+peft support for flash attn, refactor

404fd4b

add new pattern for eff attn

55113c0

er3x3 force-pushed the weicwang/attn branch from 3da5bce to 55113c0 Compare October 26, 2023 10:04

er3x3 merged commit b7408f7 into main Oct 27, 2023
87 of 90 checks passed

er3x3 deleted the weicwang/attn branch October 27, 2023 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ORTModule] ATen Efficient Attention and Triton Flash Attention #17959

[ORTModule] ATen Efficient Attention and Triton Flash Attention #17959

er3x3 commented Oct 16, 2023

github-advanced-security bot left a comment

[ORTModule] ATen Efficient Attention and Triton Flash Attention #17959

[ORTModule] ATen Efficient Attention and Triton Flash Attention #17959

Conversation

er3x3 commented Oct 16, 2023

github-advanced-security bot left a comment

Choose a reason for hiding this comment