-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BART & FSMT: fix decoder not returning hidden states from the last layer #8597
Conversation
yay, a first fsmt user that found an issue! Thank you! OK, here I literally copied the bart implementation where it didn't have that line you added: transformers/src/transformers/models/bart/modeling_bart.py Lines 627 to 629 in 36a1991
So most likely if this is indeed a bug then it affects many Now let us diagnose what's going on. I see that the transformers/src/transformers/models/bart/modeling_bart.py Lines 597 to 600 in 36a1991
Looking closely, the current code doesn't add the The only thing I'm not sure about is whether we need the Either way it is I'd code it differently. I'd add add Adding it after the loop is likely to cause other bugs in the future where the wrong x will be added. Could you please share the use case so that we could write a test for it? Or if you could write the test that's even better - either way works. I didn't have a use case for this myself so relied on Thank you! |
So this is what I propose, which does the same as your PR, but respects the locality rule better, if that makes sense.
@patrickvonplaten, how should we proceed - solve this for fsmt and then replicate to other copy-cats - or solve it at once in a new PR - and need to create a new common test I suppose. I, unfortunately, have no perms to make suggestions directly in the code. so passing the flag to you if the former. |
Thanks, Stas @stas00! I implemented a fix the way I did just to be consistent with how the analogous code is written in other places (e.g. FSMTEncoder, BERT model, etc.): transformers/src/transformers/models/bert/modeling_bert.py Lines 491 to 492 in dd52804
However, I would also personally prefer adding contextualized embedding before the loop first and then collecting hidden states at the end of the loop, just like you described. It just has to be changed for all the models in the repo if we want to keep the codebase consistent. The test might check that the size of the list with output hidden states aligns in shape with what we expect it to be based on the model configuration. It would catch the error and be general enough for many usecases. It is just that it is a job for a bigger PR if we want to cover all the models in the repo. Regarding whether to return transformers/src/transformers/models/gpt2/modeling_gpt2.py Lines 618 to 620 in 5cf9c79
Also, decoder input embeddings from layer 0 get fed into further complex layers analogously to how it is done for encoders. And for all the encoders in the lib (like BERT) we do return the outputs from this layer. So I would vote for not dropping it for the decoder. |
Well, based on the research that you shared, it's easy then - keep them all. So we just need to decide whether to:
Since I wasn't there when the code was written and since it impacts the whole project let's see what @LysandreJik, @patrickvonplaten, @sgugger think. Thank you for the detailed answer, the research, and the suggestion on how to write the missing test, @maksym-del! |
I would avoid changing the existing code since it produces the desired output, I think we can all employ our time to do more meaningful contributions to the library :-) I don't think one implementation is better than the other in the sense you have to remember to either add the first hidden state or the last. On the models that do not produce the desired outputs, you can fix it the way you prefer. The modeling files don't need to do everything the exact same way and since you're the contributor fixing things, you get to choose which one you like better. What interests me more however is how this got the tests passing, since the number of hidden states is tested and we're discovering there is one missing, a things the common tests should have caught. |
While I disagree about your suggestion to two ways being equal, since the current implementation is a bug waiting to occur, should some code be added after the loop and before the last layer's hidden state is added, especially with all the code copying. I am in agreement with the rest. To clarify, you're saying:
My intuition is that since it counts, it counted the "incoming" hidden state as one of the layer hidden states. If this is a common test, then the correct models should have failed this test instead. But will need to look at the actual test to tell for sure. |
@maksym-del thanks so much for finding this bug -> you are correct this should be corrected. I think we should do two things here (@maksym-del let me or @stas00 know if you need help here):
If you have trouble adding the test ping me or @stas00 again and we'll finish the PR for you! Thanks for spotting the bug :-) |
Thanks a lot for rebasing this! I think the only thing left to do now is to add a test as explained above :-) |
Thanks, @patrickvonplaten , @stas00 and @sgugger ! I added the test and think this PR is ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great for the new test, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you, @maksym-del
I just removed superfluous new line that was added by accident.
Unrelated to this PR, as it's replicating the existing approach, but won't it be simpler to replace:
with just:
@patrickvonplaten, replying inside my comment: this doesn't work. |
@patrickvonplaten - I edited your editing of my comment to make it readable. otherwise it made no sense as you made it look like I was saying something and then saying that it is not so. Thank you for the clarification! p.s. github doesn't send notification on edits, so this is probably not the most ideal way to reply ;) |
Oh, I'm sorry. I meant to reply to your comment :D |
…yer (huggingface#8597) * Fix decoder not returning hidden states from the last layer * Resolve conflict * Change the way to gather hidden states * Add decoder hidden states test * Make pytest and black happy * Remove redundant line * remove new line Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
What does this PR do?
The activations from the last decoder layer accidentally were not a part of the output.
Before submitting
Pull Request section?
to the it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@patrickvonplaten
@stas00