-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
last_hidden_state
has a different shape than hidden_states[-1]
in the output of SeamlessM4Tv2SpeechEncoder
if adapter layers are present
#31946
Comments
Hey @anferico, thanks for opening this issue and for the thorough explanations! First, please note that the speech encoder of M4T v2 has been open-sourced, and that we added it as a separate model in Note that the attention mask is downsampled when passed through the adapter layers:
That said, you're indeed correct in saying that What could be interesting, though, is to pass the downsampled attention mask in the output of the speech encoder of M4T v2 and W2V2-BERT, to avoid recomputing the attention mask every time, what do you think of this ? Of course, I'm open to discuss this if you have compelling reasons! Hope that it helps! |
Thanks for looking into this @ylacombe!
Regarding your second point, that's exactly what I'm proposing. The main problem for me is that I have to manually compute the attention mask every time I run a forward pass of the speech encoder, which is not ideal. So would you be in favor of defining a new |
Precisely! Don't hesitate to ping me when you do it! The way I understand it, facebook/w2v-bert-2.0 is the pre-trained checkpoint used in SeamlessM4Tv2 to initiate the speech encoder. It's indeed without adapters, because they were added during the M4T training. Note that one big difference is that the M4T speech encoder is NC licensed, whereas the license of W2V is much more permissive. That's also why we pushed the latter checkpoint rather than the former. It's really interesting to see the difference in performance though. Out of curiosity, how large is the gap between the two models (and for what languages)? Also, would you be interesting in finding ways to bridge this gap? |
Sure, will update you once I'm done. Now I see the difference between the two, thank you for the insight 👍🏼 I haven't really measured properly the performance gap between w2v-bert-2.0 and SeamlessM4Tv2. What I can tell you is that on an English ASR task (LibriTTS), the architecture based on w2v-bert-2.0 simply didn't converge within a reasonable time, whereas the one based on SeamlessM4Tv2 did.
Definitely, but what do you mean exactly? I'm open to any sort of collaborations if that's what you mean 😀 |
Hey @anferico, this is weird, have you made sure to add an adapter layer ? See how it's done in the blog post: from transformers import Wav2Vec2BertForCTC
model = Wav2Vec2BertForCTC.from_pretrained(
"facebook/w2v-bert-2.0",
attention_dropout=0.0,
hidden_dropout=0.0,
feat_proj_dropout=0.0,
mask_time_prob=0.0,
layerdrop=0.0,
ctc_loss_reduction="mean",
add_adapter=True,
pad_token_id=processor.tokenizer.pad_token_id,
vocab_size=len(processor.tokenizer),
) |
I kinda did, but mine was a custom setting (I did not use |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Commenting to keep this alive until I find some time to work on it 😞 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.43.0.dev0Who can help?
@sanchit-gandhi @ylacombe
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
SeamlessM4Tv2SpeechEncoder
with 1 or more adapter layer(s) having stride > 1, for example by doing:last_hidden_state
is different than the shape ofhidden_states[-1]
:last_hidden_state
is not compatible with the shape ofattentions[-1]
:Expected behavior
Why this is a problem (in my view)
last_hidden_state
is different thanhidden_states[-1]
SeamlessM4Tv2SpeechEncoder
is used as a speech encoder in a model architecture used for ASR, the full model architecture being speech encoder + custom text decoder. If we train this model with batch size > 1, the speech encoder will be fed padded audio sequences. As a result, when feeding encoded audio sequences (output of the speech encoder) to the custom text decoder, we have to construct a properattention_mask
to make sure padded positions are treated as such. Normally, to do this, we would takespeech_encoder_output.attentions
from the speech encoder output, convert them to anattention_mask
by looking at which elements are > 0 (i.e. which positions in the sequence have an attention weight > 0), then apply the obtainedattention_mask
tospeech_encoder_output.last_hidden_states
. However, this cannot be done since, as mentioned above,seq_len_1 != seq_len_2
Because of 2), the only way to apply
attention_mask
tospeech_encoder_output.last_hidden_states
is to manually figure out the correct shape ofattention_mask
by considering how many convolutional layers are present inspeech_encoder.adapter
(instance ofSeamlessM4Tv2ConformerAdapter
) and what theirpadding
,dilation
,kernel_size
andstride
parameters are, then compute the output length (seq_len_1
) as a function of the input length (seq_len_2
) as:Proposed workaround
Instead of doing this at the end of
SeamlessM4Tv2SpeechEncoder.forward()
:do something like:
The text was updated successfully, but these errors were encountered: