-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bettertransformer] Transformers 4.41.0 (torch.SDPA-Bert)
breaks bettertransformers Bert, but works in Transformers 4.40.2
#1902
Comments
Transformers
4.41 (torch.SDPA-Bert) breaks Bettertransformer Bert, but works in
4.40.2`Transformers ==4.41.0 (torch.SDPA-Bert)
breaks bettertransformers Bert, but works in Transformers ==4.40.2
Transformers ==4.41.0 (torch.SDPA-Bert)
breaks bettertransformers Bert, but works in Transformers ==4.40.2
Transformers 4.41.0 (torch.SDPA-Bert)
breaks bettertransformers Bert, but works in Transformers 4.40.2
Yea, it looks like BetterTransformer might be expecting a different shape for the attention mask. Can you try to use "eager" attention implementation with BetterTransformer to see if it fixes things? |
Thanks for the fast response. Eager works, but its a breaking change if you dont add it! |
Yea, unfortunately I think there might be cause to put BetterTransformer optimizations directly into Transformer, and deprecate BetterTransformer support for BERT. This means adding BERT here:
It might be better you for to just skip that BetterTransformer conversion. You mentioned that BetterTransfomer is still 1.5x faster, where did you get that metric? |
@hackyon Might be unusual, but should give a pretty good throughput idea end-to-end.
Bettertransformer (eager -> torch._transformer_encoder_fwd)
SDPA and w/o Bettertransformer
Result: Please don't remove the option to use Bettertransformers! I do rely on the patch in BetterTransformers with Bert. |
Hi @michaelfeil, I believe what you are benefiting is the nested tensor support in BetterTransformer, that allows speedups in batched inferences cases due to not using padding. This is not integrated in Transformers. Indeed, having breaking changes is not very ideal, although SDPA is now supported in Transformers, the above is not. |
|
System Info
Who can help?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Installing
torch=2.3.1
andtransformers=4.41.0 (or transformers=4.40.2 for fix)
.Output:
Solution
Works:
Expected behavior
Bettertransformer is still 1.5x faster than torch.sdpa -> stuck with pinning huggingface transformers with <4.40.2 for now.
The text was updated successfully, but these errors were encountered: