IFU to v2.0.4 #14

jayz0123 · 2023-09-19T17:37:42Z

renamed "unpadded" -> "varlen" with mha_varlen_fwd & mha_varlen_bwd APIs.
changed mha_fwd & mha_bwd for input with the same sequence lengths in the same batch.
setup.py will install for either CUDA/ROCm system.
rename test_flash_attn -> test_flash_attn_rocm for ROCm unit test.
benchmark testing
Sync to PR bwd optimizing based on profiling #15
add unit tests for mha_fwd&mha_bwd
MQA/GQA
unit test all pass

Current Unit Test Result: (PyTorch 2.0.0; ROCm 5.6)
3968 passed, 63 skipped

Current Performance on MI250: (docker pull rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1)

	fwd tflops	bwd tflops	total
fp16	52.16	39.93	42.49
bf16	52.36	30.25	34.21

Add gpt-neox adoption

Follow xFormers's DISTPATCH_BOOL. Haven't tested it on Windows.

fixed cross attention typeerror

jayz0123 · 2023-10-26T07:51:38Z

A new environment variable "FLASH_ATTENTION_INTERNAL_ENABLE_TIME_KERNEL" can switch the output of kernel running time

jayz0123 · 2023-10-26T09:37:45Z

[BUGs] Previously in older version of FA, we create tensors z and softmax_lse matrix of max sequence lengths with no padding for grouped gemm. But the strides for each batch for the tensors are different. This behaviour will cause wrong result from CK. Fixing it.

…Platform/flash-attention into junhzhan/ifu-v2.0.0

Add MQA & GQA

fsx950223 · 2023-10-30T07:31:20Z

Please remove *_hip.hpp

fsx950223 · 2023-10-30T09:31:07Z

.gitignore

@@ -24,7 +24,10 @@ var/
 .vscode/settings.

 # Generated files
+csrc/flash_attn_rocm/src/*hip*


better to use *_hip.*?

makes sense

Yes, file names such as hip_flash_attention.cpp or hip_hacks.hpp are ignored too?

…Platform/flash-attention into junhzhan/ifu-v2.0.0

tridao and others added 30 commits December 25, 2022 14:29

Implement Tensor Parallel for GPT2Embeddings

78225c5

Implement Tensor Parallel for GPT model

b4018a5

Add a simple tutorial to README.md

c9a6498

Tweak CrossEntropyLoss to take process_group in init

c6ecd40

Support loading GPT2 weights from Huggingface

9d797d8

Implement generation for GPT

63670fd

Bump to v0.2.6

a6ec178

Add gpt-neox adoption

d2a69a5

Merge pull request Dao-AILab#95 from Quentin-Anthony/patch-1

0296171

Add gpt-neox adoption

Update training Dockerfile to use flash-attn==0.2.6

984d520

[Docs] Mention that XPos's scale_base is recommended to be 512

85b8e3d

[Docs] Mention that dropout_layer_norm supports all dims up to 6k

3c7cbfc

[Docs] Fix formatting

4379896

[Docker] Set torchmetrics==0.10.3

cadfa39

[Loss] Use flash_attn.losses.cross_entropy.CrossEntropyLoss

71befc1

[FusedDense] Kick off input all_gather before weight dtype conversion

65b4064

[GPT] Refactor function to shard state_dict for TensorParallel

ef1ba91

[Bert] Fix embedding layer norm before embedding dropout

714c1b4

[FusedDense] Limit matrix dims to 2M (instead of 64k)

1ec09eb

[TP] Put parallel embeddings in separate modules

4cab4de

[Gen] Add kernel from FasterTransformer for benchmarking

a01d121

[Gen, FT] Use tlength instead of params.timestep for rotary

f266fc7

[Gen, FT] Use fp32 accum for FMA

be1afaa

[Gen] Add option to run generation with FT attention kernel

a668890

[Compilation] Change BOOL_SWITCH to fix Windows compilation

a1f49a2

Follow xFormers's DISTPATCH_BOOL. Haven't tested it on Windows.

[LayerNorm] Implement RMS Norm

6738d94

Bump to v0.2.7

ce26d3d

fixed cross attention typeerror

aec35fd

[TP] Implement TensorParallel without sequence parallel

93383bd

Merge pull request Dao-AILab#102 from Lamikins/main

8d9674e

fixed cross attention typeerror

fsx950223 and others added 12 commits October 26, 2023 21:08

add time kernel env

f4c8dde

updated ckbackend

6bc3374

simple code

b4d20b2

modified api to support mqa gqa

15c19e2

fix dropout z tensors allocation; enable unit test

0c5b579

Merge branch 'junhzhan/ifu-v2.0.0' of https://github.com/ROCmSoftware…

d7b631a

…Platform/flash-attention into junhzhan/ifu-v2.0.0

added mqa gqa

b5ba498

updated ck backend

cc78698

enableed mqa gqa for batched conditions

6daeb0c

fixed params

b6a9f6e

passed mqa & gqa for varlen tests

5e80fc7

Merge pull request #16 from ROCmSoftwarePlatform/ifu-mqa

02c234b

Add MQA & GQA

jayz0123 force-pushed the junhzhan/ifu-v2.0.0 branch from d475794 to 02c234b Compare October 30, 2023 09:08

fsx950223 reviewed Oct 30, 2023

View reviewed changes

Junhao Zhang and others added 12 commits October 30, 2023 17:34

Update .gitignore

b27bd1d

better .gitignore

9a5273d

update git ignore

2d11119

update uint8 dropout in FA

4d79450

update RTN swtich; enable MQA/GQA UT

a197406

Merge branch 'junhzhan/ifu-v2.0.0' of https://github.com/ROCmSoftware…

8da5b66

…Platform/flash-attention into junhzhan/ifu-v2.0.0

tidy codes

5378a20

add legacy interface support

1b808f4

Merge branch 'junhzhan/ifu-v2.0.0' of https://github.com/ROCmSoftware…

23ee8fb

…Platform/flash-attention into junhzhan/ifu-v2.0.0

code formatting

0c92f31

fixed bugs for grouped mha && d%8=0

2c057b4

Disable MQA UT

1cd7f89

sabreshao merged commit edc7698 into flash_attention_for_rocm Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IFU to v2.0.4 #14

IFU to v2.0.4 #14

jayz0123 commented Sep 19, 2023 •

edited

Loading

jayz0123 commented Oct 26, 2023

jayz0123 commented Oct 26, 2023 •

edited

Loading

fsx950223 commented Oct 30, 2023

fsx950223 Oct 30, 2023 •

edited

Loading

jayz0123 Oct 30, 2023

fsx950223 Oct 30, 2023 •

edited

Loading

IFU to v2.0.4 #14

IFU to v2.0.4 #14

Conversation

jayz0123 commented Sep 19, 2023 • edited Loading

jayz0123 commented Oct 26, 2023

jayz0123 commented Oct 26, 2023 • edited Loading

fsx950223 commented Oct 30, 2023

fsx950223 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

jayz0123 Oct 30, 2023

Choose a reason for hiding this comment

fsx950223 Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

jayz0123 commented Sep 19, 2023 •

edited

Loading

jayz0123 commented Oct 26, 2023 •

edited

Loading

fsx950223 Oct 30, 2023 •

edited

Loading

fsx950223 Oct 30, 2023 •

edited

Loading