[`core` / `attention`] Fix fused attention generation with newest transformers version #146

younesbelkada · 2023-11-03T12:30:08Z

What does this PR do?

Currently in the latest transformers release, using AutoAWQ + fused attention with cache is broken
In huggingface/transformers#25242 the logic of caching has changed a bit, now when using transformers cache + a past key value length of 1 (as done here), the input ids will be sliced as such:

input_ids = input_ids[:, 1:]

Meaning the assumption if seqlen == 1: to deal with the transformers cache case needs now to be adapted, one can just check if past_key_values is present in kwargs and contains valid tensors, and slice out only the last token if that's the case

cc @casper-hansen

awq/modules/fused/attn.py

younesbelkada · 2023-11-03T12:32:38Z

I also checked out to this commit in transformers: huggingface/transformers#26162 (before huggingface/transformers#25242) and can confirm it works in both cases.

casper-hansen · 2023-11-03T23:00:58Z

Tested and looks good. No performance regression on my end.

casper-hansen · 2023-11-03T23:32:30Z

I take the remark about performance regression back. I tested using my benchmark.py script found in examples and saw no difference. But using the .generate() function, it is 50% slower.

Slicing the hidden states in every attention layer for every token is a lot of overhead. We should instead slice it at a higher level, e.g. in the model. However, that requires implementing a LlamaModel, MistralModel, AquilaModel. This is probably the right solution but requires a bit of work, which I will look into.

younesbelkada · 2023-11-04T08:52:31Z

Thanks for benchmarking ! yes slicing only once at the model level makes sense!

Update attn.py

bfb27f2

younesbelkada commented Nov 3, 2023

View reviewed changes

awq/modules/fused/attn.py Outdated Show resolved Hide resolved

Update awq/modules/fused/attn.py

95a69d4

younesbelkada commented Nov 3, 2023

View reviewed changes

awq/modules/fused/attn.py Outdated Show resolved Hide resolved

Update awq/modules/fused/attn.py

b8feee2

younesbelkada requested a review from casper-hansen November 3, 2023 12:31

Merge branch 'main' into younesbelkada-patch-1

b0137ea

casper-hansen merged commit 92a403b into main Nov 3, 2023

younesbelkada deleted the younesbelkada-patch-1 branch November 4, 2023 08:52

younesbelkada mentioned this pull request Nov 21, 2023

Fix mistral generate for long prompt / response huggingface/transformers#27548

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core` / `attention`] Fix fused attention generation with newest transformers version #146

[`core` / `attention`] Fix fused attention generation with newest transformers version #146

younesbelkada commented Nov 3, 2023

younesbelkada commented Nov 3, 2023

casper-hansen commented Nov 3, 2023

casper-hansen commented Nov 3, 2023

younesbelkada commented Nov 4, 2023

[core / attention] Fix fused attention generation with newest transformers version #146

[core / attention] Fix fused attention generation with newest transformers version #146

Conversation

younesbelkada commented Nov 3, 2023

What does this PR do?

younesbelkada commented Nov 3, 2023

casper-hansen commented Nov 3, 2023

casper-hansen commented Nov 3, 2023

younesbelkada commented Nov 4, 2023

[`core` / `attention`] Fix fused attention generation with newest transformers version #146

[`core` / `attention`] Fix fused attention generation with newest transformers version #146