-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggestion for symmetric alibi implementation. #10
Comments
@seastar105 hey Haesung, i've tried symmetric Alibi before, but the best results i've had is still dynamic positional bias. that is what i would use if i had to use attention bias since flash attention came on the scene though, it is not preferable to use attention bias. rotary embeddings should be a good fit here, given it has been proven out in a number of significant models (palm, llama) |
@seastar105 actually, i spoke too soon - let me read how they did their implementation this morning. do you know if there was a paper with the necessary data to show the effectiveness of their proposal? |
@seastar105 ok, just took a quick look. i would say that is not correct. how would the network differentiate between left and right given the same relative distance? my past attempt involved allowing the network to learn different slopes between left and right, but it didn't work that well as just letting it completely parameterized by an MLP (like NERF) |
i don't think i'm going to go with this until i see a follow up paper |
actually i have no idea, and do not know papers to suggest why symmetric alibi bias can work better than dynamic positional bias in bidirectional encoder, especially for rope, also why symmetric alibi is better than assymetric way. i've just noticed alibi is used for audio in data2vec 2.0, and also voicebox. my hyphothesis was i really appreciate your experience of alibi bias in bidirectional trasformer. i'm gonna compare conformer models trained for ASR and fine-tuned alibi feature encoder. if i have any decent results, i'm gonna report here. |
@lucidrains Sorry this is a bit late but writing this for completeness. We updated the paper with details on Alibi Bias. We did find Alibi bias to converge faster than fixed positional embeddings or no pos. It is similar to the symmetric option here: ofirpress/attention_with_linear_biases#5 Note: Flash Attention v2 also supports Alibi bias now. |
@apoorv2904 sounds good, i think i will stick with rotary for this repo |
as you mentioned, alibi is not trivial to apply for bidirectional encoder.
there's some work from meta, data2vec 2.0, where they employed alibi with bidirectional encoder, especially for audio encoder.
voicebox adopted similar architecture, incorporating both convolutional positional encoding and alibi. and it appears they utilized a symmetric version of alibi. you can see alibi bias part of data2vec 2.0 code.
https://github.com/facebookresearch/fairseq/blame/100cd91db19bb27277a06a25eb4154c805b10189/examples/data2vec/models/modalities/base.py#L568-L577
what do you think about applying symmetric alibi?
The text was updated successfully, but these errors were encountered: