[`Mistral`] Add Flash Attention-2 support for `mistral` #26464

younesbelkada · 2023-09-28T11:04:46Z

What does this PR do?

Adds Flash Attention 2 for Mistral For Causal - we still need to discuss how to integrate it with local attention

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    use_flash_attention_2=True,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
).to(0)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=4096, use_cache=True, do_sample=True)
print(tokenizer.batch_decode(out, skip_special_tokens=True))

HuggingFaceDocBuilderDev · 2023-09-28T11:19:50Z

The documentation is not available anymore as the PR was closed or merged.

…ransformers into add-mistral-fa-2

timlacroix · 2023-09-28T16:39:27Z

Might be worth adding support for sliding window attention ?
params are window_size_left = config.sliding_window-1 // window_size_right = -1 I believe ?
See this PR adding support in flashattentionv2
Dao-AILab/flash-attention@083e8f5

timlacroix · 2023-09-28T16:53:02Z

src/transformers/models/mistral/modeling_mistral.py

+
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+
+        use_sliding_windows = _is_flash_using_slicing_windows and kv_seq_len > self.config.sliding_window and not self.training


typo, should be "_is_flash_using_sliding_windows"

also I don't think this bool is needed.

If self.config.sliding_window is not None -> use sliding_window always, whether training or inferencing no ?

Hmm good point, for some reason I thought that feature works only for inference (from the source code's readme: https://github.com/mistralai/mistral-src#sliding-window-to-speed-up-inference-and-reduce-memory-pressure I have read "speed up inference" so I thought that was only available for inference) - will remove that condition

timlacroix · 2023-09-28T16:57:26Z

src/transformers/models/mistral/modeling_mistral.py

+            if not use_sliding_windows:
+                attn_output_unpad = flash_attn_varlen_func(
+                    query_states,
+                    key_states,
+                    value_states,
+                    cu_seqlens_q=cu_seqlens_q,
+                    cu_seqlens_k=cu_seqlens_k,
+                    max_seqlen_q=max_seqlen_in_batch_q,
+                    max_seqlen_k=max_seqlen_in_batch_k,
+                    dropout_p=dropout,
+                    softmax_scale=softmax_scale,
+                    causal=True,
+                )
+            else:
+                attn_output_unpad = flash_attn_varlen_func(
+                    query_states,
+                    key_states,
+                    value_states,
+                    cu_seqlens_q=cu_seqlens_q,
+                    cu_seqlens_k=cu_seqlens_k,
+                    max_seqlen_q=max_seqlen_in_batch_q,
+                    max_seqlen_k=max_seqlen_in_batch_k,
+                    dropout_p=dropout,
+                    softmax_scale=softmax_scale,
+                    causal=True,
+                    window_size=(self.config.sliding_window, self.config.sliding_window)
+                )


could probably be factored ? something like window_size=(self.config.sliding_window or -1, -1) ?

timlacroix · 2023-09-28T16:58:03Z

src/transformers/models/mistral/modeling_mistral.py

+                )
+            else:
+                attn_output = flash_attn_func(
+                    query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=True, window_size=(self.config.sliding_window // 2, self.config.sliding_window // 2)


same comment on factoring.
I have no idea what's going on with // 2 here but I don't know what padding_mask is :|

Oh yeah ignore the //2 I used it for testing purpose :D

padding mask is the "pure" attention mask (not causal mask) 0 if padding token 1 if not --> I use it in the control flow for flash attention modules whether I need to pad / unpad or no

src/transformers/models/mistral/modeling_mistral.py

younesbelkada · 2023-10-02T15:50:39Z

Sharing some results here vs mistral official implementation that uses xformers.memory_efficient_attention:

Context length = 12 / max_new_tokens=512 / bs=1

HF transformers + FA-2

Latency: 15.1241201171875
33.85320904838279 tokens / s
Max allocated memory: 15218032640

Mistral + mem efficient:

Latency: 17.23331640625
29.709893785407036 tokens / s
Max allocated memory: 14636799488

Context length = 11K / max_new_tokens=512 / bs=1

HF transformers + FA-2

Latency: 16.497216796875
31.03553807312431 tokens / s
Max allocated memory: 18673463808

Mistral + mem efficient:

Latency: 22.50997265625
22.74547409802565 tokens / s
Max allocated memory: 17303250944

Context length = 11K / max_new_tokens=512 / bs=2 with 11K padding tokens on the second batch

HF transformers + FA-2

Latency: 33.95778515625
15.077544004832287 tokens / s
Max allocated memory: 22320273408

Mistral + mem efficient:

Latency: 30.407841796875
16.83776189774238 tokens / s
Max allocated memory: 17841224192

Context length = 11K / max_new_tokens=512 / bs=4 with 11K padding tokens on the second, third and fourth batch

HF transformers + FA-2

Latency: 48.86058984375
10.478792860203109 tokens / s
Max allocated memory: 29610738688

Mistral + mem efficient:

Latency: 45.27477734375
11.308724858272097 tokens / s
Max allocated memory: 18914968576

--> obviously the pad / unpad overhead takes it over for the HF implementation whereas the official repository deals with padding tokens differently. Note also that the max allocated memory increases if one adds padding token. Also note the current cache slicing mechanism assumes users are under padding=left regime. Generation should be performed with padding_side=left whereas this should have no impact for training as the cache is not used during training.

Here is a plot that compares pure forward on HF native vs HF + FA-2

…ransformers into add-mistral-fa-2

younesbelkada · 2023-10-02T16:23:09Z

For the sake of completeness,

Script I used to benchmark transformers + FA2: https://gist.github.com/younesbelkada/691c1dec3da2f0a7de29c1d1096d860f

Script I used to benchmark mistral original source code: https://gist.github.com/younesbelkada/ada0d9c2c48ab034486dbaaf95d29fae (assuming you have cloned their repository and run it under the root folder of the repo)

…ransformers into add-mistral-fa-2

ArthurZucker

Thanks a lot! Looking good.

docs/source/en/perf_infer_gpu_one.md

src/transformers/models/mistral/modeling_mistral.py

ArthurZucker · 2023-10-03T08:38:19Z

src/transformers/models/mistral/modeling_mistral.py

+        if not _is_flash_using_sliding_windows:
+            logger.warning_once(
+                "The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
+                " make sure to upgrade flash-attn library."
+            )


should go in the import instead of here

I don't think so because it will raise the warning if you have FA installed even if you not use the flash attention converted mistral model

ArthurZucker · 2023-10-03T08:41:14Z

src/transformers/models/mistral/modeling_mistral.py

+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in float16 just to be sure everything works as expected.
+        # This might slowdown training & inference so it is recommended to not cast the LayerNorms
+        # in fp32. (LlamaRMSNorm handles it correctly)
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            logger.warning_once(
+                "The input hidden states seems to be silently casted in float32, this might be related to"
+                " the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                " float16."
+            )
+
+            query_states = query_states.to(torch.float16)
+            key_states = key_states.to(torch.float16)
+            value_states = value_states.to(torch.float16)


not sure if this should be included since its peft only always casts to float16 even if input is bfloat16

MistralRMSNorm behaves exactly as LLamaRMSNorm so it will silently cast the hidden states in fp32, therefore this is needed. As mentioned offline I will address a proper fix for bf16 issues

ArthurZucker · 2023-10-03T08:43:32Z

src/transformers/models/mistral/modeling_mistral.py

+                logger.warning_once(
+                    "You are attempting to perform batched generation with padding_side='right'"
+                    " this may lead to unexpected behaviour for Flash Attention version of Mistral. Make sure to "
+                    " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
+                )


given how bad the outputs were might be a good idea to for padding side / raise and error

@younesbelkada the implementation here doesn't seem specific to generation in any way. Is the error message wrong or is the implementation wrong (or am I missing something)? That is, should I be able to run forward with right padding and flash2?

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

…ransformers into add-mistral-fa-2

ArthurZucker

Thanks of iterating!

vince62s · 2023-10-11T06:24:06Z

src/transformers/models/mistral/modeling_mistral.py

+        if past_key_value is not None:
+            # Activate slicing cache only if the config has a value `sliding_windows` attribute
+            if hasattr(self.config, "sliding_window") and kv_seq_len > self.config.sliding_window:
+                slicing_tokens = kv_seq_len - self.config.sliding_window


@younesbelkada sorry to bother again but, if kv_seq_len > N times sliding_windows for instance seq_len = 9000 and sliding_window = 4096 shouldn't slicing_tokens be 9000 - 2 x 4096 instead of 9000 - 4096 ?

Hmm possibly yes, I need to double check with the original code, I'll get back to you on this!

@vince62s I could not find any relevant piece of code in the source code of mistral to confirm your statement, can you help me identifying the place where you think we indeed need to slice only N*4096?

I think I was mistaken. We are adding key/value one by one here so kv_seq_len is never > self.config.sliding_window + 1 it is exactly equal to. I was misled my line 374.

…26464) * add FA-2 support for mistral * fixup * add sliding windows * fixing few nits * v1 slicing cache - logits do not match * add comment * fix bugs * more mem efficient * add warning once * add warning once * oops * fixup * more comments * copy * add safety checker * fixup * Update src/transformers/models/mistral/modeling_mistral.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * copied from * up * raise when padding side is right * fixup * add doc + few minor changes * fixup --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

younesbelkada added 2 commits September 28, 2023 12:55

add FA-2 support for mistral

5d9bc48

fixup

0983d88

younesbelkada mentioned this pull request Sep 28, 2023

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Open

24 tasks

younesbelkada added 2 commits September 28, 2023 18:29

add sliding windows

b8c9198

Merge branch 'add-mistral-fa-2' of https://github.com/younesbelkada/t…

d740849

…ransformers into add-mistral-fa-2

timlacroix reviewed Sep 28, 2023

View reviewed changes

younesbelkada added 2 commits September 28, 2023 19:00

fixing few nits

bd58ca7

v1 slicing cache - logits do not match

43b0289

ArthurZucker reviewed Sep 29, 2023

View reviewed changes

src/transformers/models/mistral/modeling_mistral.py Outdated Show resolved Hide resolved

younesbelkada added 8 commits September 29, 2023 11:25

add comment

ed2616f

fix bugs

7cafc2d

more mem efficient

2b8c7b4

add warning once

4a3387d

add warning once

885b601

oops

172d99a

fixup

253b383

more comments

e4d0fb7

younesbelkada added 2 commits October 2, 2023 15:51

copy

a245722

Merge branch 'add-mistral-fa-2' of https://github.com/younesbelkada/t…

3079896

…ransformers into add-mistral-fa-2

younesbelkada added 3 commits October 2, 2023 19:12

add safety checker

e71c50d

Merge branch 'add-mistral-fa-2' of https://github.com/younesbelkada/t…

a21d903

…ransformers into add-mistral-fa-2

fixup

5d1f589

ArthurZucker approved these changes Oct 3, 2023

View reviewed changes

younesbelkada marked this pull request as ready for review October 3, 2023 08:51

younesbelkada and others added 9 commits October 3, 2023 10:51

Update src/transformers/models/mistral/modeling_mistral.py

b478e04

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

copied from

2fe2f49

up

25789d1

raise when padding side is right

05ec7f4

Merge branch 'add-mistral-fa-2' of https://github.com/younesbelkada/t…

5a79195

…ransformers into add-mistral-fa-2

fixup

f9a69bc

add doc + few minor changes

6a48dd3

Merge branch 'add-mistral-fa-2' of https://github.com/younesbelkada/t…

76763c7

…ransformers into add-mistral-fa-2

fixup

c286946

younesbelkada changed the title ~~[Mistral] Add mistral + FA 2~~ [Mistral] Add Flash Attention-2 support for mistral Oct 3, 2023

younesbelkada requested a review from ArthurZucker October 3, 2023 09:58

ArthurZucker approved these changes Oct 3, 2023

View reviewed changes

younesbelkada merged commit ae9a344 into huggingface:main Oct 3, 2023
18 checks passed

younesbelkada deleted the add-mistral-fa-2 branch October 3, 2023 11:44

younesbelkada mentioned this pull request Oct 4, 2023

Mistral loss instability #26498

Closed

4 tasks

vince62s reviewed Oct 11, 2023

View reviewed changes

ArthurZucker mentioned this pull request Oct 27, 2023

example code problem #27092

Open

younesbelkada mentioned this pull request Nov 7, 2023

Should I be getting more speedup/memory reduction from FlashAttention2 with Mistral? #27329

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Mistral`] Add Flash Attention-2 support for `mistral` #26464

[`Mistral`] Add Flash Attention-2 support for `mistral` #26464

younesbelkada commented Sep 28, 2023

HuggingFaceDocBuilderDev commented Sep 28, 2023 •

edited

Loading

timlacroix commented Sep 28, 2023

timlacroix Sep 28, 2023

timlacroix Sep 28, 2023

younesbelkada Sep 28, 2023

timlacroix Sep 28, 2023

timlacroix Sep 28, 2023

younesbelkada Sep 28, 2023

younesbelkada Sep 28, 2023

younesbelkada commented Oct 2, 2023

younesbelkada commented Oct 2, 2023 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Oct 3, 2023

younesbelkada Oct 3, 2023

ArthurZucker Oct 3, 2023

younesbelkada Oct 3, 2023

ArthurZucker Oct 3, 2023

dakinggg Oct 14, 2023

ArthurZucker left a comment

vince62s Oct 11, 2023

younesbelkada Oct 11, 2023

younesbelkada Oct 16, 2023

vince62s Oct 16, 2023


		query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

		use_sliding_windows = _is_flash_using_slicing_windows and kv_seq_len > self.config.sliding_window and not self.training

[Mistral] Add Flash Attention-2 support for mistral #26464

[Mistral] Add Flash Attention-2 support for mistral #26464

Conversation

younesbelkada commented Sep 28, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 28, 2023 • edited Loading

timlacroix commented Sep 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Oct 2, 2023

Context length = 12 / max_new_tokens=512 / bs=1

Context length = 11K / max_new_tokens=512 / bs=1

Context length = 11K / max_new_tokens=512 / bs=2 with 11K padding tokens on the second batch

Context length = 11K / max_new_tokens=512 / bs=4 with 11K padding tokens on the second, third and fourth batch

younesbelkada commented Oct 2, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[`Mistral`] Add Flash Attention-2 support for `mistral` #26464

[`Mistral`] Add Flash Attention-2 support for `mistral` #26464

HuggingFaceDocBuilderDev commented Sep 28, 2023 •

edited

Loading

younesbelkada commented Oct 2, 2023 •

edited

Loading