Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

kunal-vaishnavi · 2024-03-29T21:42:30Z

Description

This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch.

Motivation and Context

The flash attention v2 algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the bitsandbytes package.

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py

### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.

…osoft#20149) ### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.

kunal-vaishnavi added 4 commits March 28, 2024 17:27

Enable flash attention v2 for PyTorch models when benchmarking

0fce15e

Add instructions for installing flash attention v2

701d5f3

Add INT4 CUDA benchmarking for PyTorch eager

15f0ab6

Add instructions for installing PyTorch quantization

3232e42

kunal-vaishnavi added the release:1.17.3 label Mar 29, 2024

hanbitmyths reviewed Mar 29, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py Outdated Show resolved Hide resolved

Use flash attention v2 for CUDA and SDPA for CPU

3e7b79e

hanbitmyths reviewed Mar 29, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py Show resolved Hide resolved

hanbitmyths approved these changes Mar 29, 2024

View reviewed changes

kunal-vaishnavi merged commit a0ebd5f into microsoft:main Mar 30, 2024
94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

kunal-vaishnavi commented Mar 29, 2024

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking #20149

Conversation

kunal-vaishnavi commented Mar 29, 2024

Description

Motivation and Context