Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestion for symmetric alibi implementation. #10

Closed
seastar105 opened this issue Aug 18, 2023 · 7 comments
Closed

suggestion for symmetric alibi implementation. #10

seastar105 opened this issue Aug 18, 2023 · 7 comments

Comments

@seastar105
Copy link

as you mentioned, alibi is not trivial to apply for bidirectional encoder.

there's some work from meta, data2vec 2.0, where they employed alibi with bidirectional encoder, especially for audio encoder.

voicebox adopted similar architecture, incorporating both convolutional positional encoding and alibi. and it appears they utilized a symmetric version of alibi. you can see alibi bias part of data2vec 2.0 code.

https://github.com/facebookresearch/fairseq/blame/100cd91db19bb27277a06a25eb4154c805b10189/examples/data2vec/models/modalities/base.py#L568-L577

what do you think about applying symmetric alibi?

@lucidrains
Copy link
Owner

@seastar105 hey Haesung, i've tried symmetric Alibi before, but the best results i've had is still dynamic positional bias. that is what i would use if i had to use attention bias

since flash attention came on the scene though, it is not preferable to use attention bias. rotary embeddings should be a good fit here, given it has been proven out in a number of significant models (palm, llama)

@lucidrains
Copy link
Owner

@seastar105 actually, i spoke too soon - let me read how they did their implementation this morning. do you know if there was a paper with the necessary data to show the effectiveness of their proposal?

@lucidrains
Copy link
Owner

lucidrains commented Aug 18, 2023

@seastar105 ok, just took a quick look. i would say that is not correct. how would the network differentiate between left and right given the same relative distance? my past attempt involved allowing the network to learn different slopes between left and right, but it didn't work that well as just letting it completely parameterized by an MLP (like NERF)

@lucidrains
Copy link
Owner

i don't think i'm going to go with this until i see a follow up paper

@seastar105
Copy link
Author

seastar105 commented Aug 18, 2023

actually i have no idea, and do not know papers to suggest why symmetric alibi bias can work better than dynamic positional bias in bidirectional encoder, especially for rope, also why symmetric alibi is better than assymetric way. i've just noticed alibi is used for audio in data2vec 2.0, and also voicebox.

my hyphothesis was learned alibi bias could be lightweight alternative to conformer since it penalizes attention scores to be local(at least, 0.5 is quite high penalty for attention scores). conformer performs better than vanila transformers in many works, even generation of speech, appeared in espnet's report (https://arxiv.org/pdf/2010.13956.pdf). so, suggestion is not backed on concrete experiments.

i really appreciate your experience of alibi bias in bidirectional trasformer. i'm gonna compare conformer models trained for ASR and fine-tuned alibi feature encoder. if i have any decent results, i'm gonna report here.

@apoorv2904
Copy link

apoorv2904 commented Feb 22, 2024

@lucidrains Sorry this is a bit late but writing this for completeness. We updated the paper with details on Alibi Bias. We did find Alibi bias to converge faster than fixed positional embeddings or no pos. It is similar to the symmetric option here: ofirpress/attention_with_linear_biases#5

Note: Flash Attention v2 also supports Alibi bias now.

image

@lucidrains
Copy link
Owner

@apoorv2904 sounds good, i think i will stick with rotary for this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants