Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss pooling layer parameters after Fine-tune. #8793

Closed
wlhgtc opened this issue Nov 26, 2020 · 7 comments
Closed

Loss pooling layer parameters after Fine-tune. #8793

wlhgtc opened this issue Nov 26, 2020 · 7 comments
Labels

Comments

@wlhgtc
Copy link
Contributor

wlhgtc commented Nov 26, 2020

According to the code: if I want to fine-tune BERT with LM, we don't init pooling layer.
So we loss the original(pre-trained by Google) parameters if we save the fine-tune model and reload it.
Mostly, we use this model for downstream task( text classification), this (may) lead to a worse result.
This add_pooling_layer should be true for all time even if we don't update them in fine-tune.
@thomwolf @LysandreJik

@LysandreJik
Copy link
Member

LysandreJik commented Nov 27, 2020

The pooling layer is not used during the fine-tuning if doing MLM, so gradients are not retro-propagated through that layer; the parameters are not updated.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Nov 28, 2020

@LysandreJik The pooling parameters are not needed in MLM fine-tune. But usually, we use MLM to fine-tune BERT on our own corpus, then we use the saved model weight(missed pooling parameters) in downstream task.
It's unreasonable for us to random initialize the pool parameters, we should reload google's original pooling parameter(though it was not update in MLM fine-tune).

@LysandreJik
Copy link
Member

I see, thank you for explaining! In that case, would using the BertForPreTraining model fit your needs? You would only need to pass the masked LM labels, not the NSP labels, but you would still have all the layers that were used for the pre-training.

This is something we had not taken into account when implementing the add_pooling_layer argument cc @patrickvonplaten @sgugger

@mscherrmann
Copy link

Hi @LysandreJik,

I also tried to further pre-train BERT with new, domain specific text data using the recommended run_mlm_wwm.py file, since I read a paper which outlines the benefits of this approach. I also got the warning that the Pooling Layers are not initialized from the model checkpoint. I have a few follow up questions to that:

  • Does that mean that the final hidden vector of the [CLS] token is randomly initialized? That would be an issue for me since I need it in my downstream application.
  • If the former point is true: Why is not at least the hidden vector of the source model copied?
  • I think to get a proper hidden vector for [CLS], NSP would be needed. If I understand your answers in issue BertForPreTraining with NSP #6330 correctly, you don't support the NSP objective due to the results of the RoBERTa paper. Does that mean there is no code for pre-training BERT in the whole huggingface library which yields meaningful final [CLS] hidden vectors?
  • Is there an alternative to [CLS] for downstream tasks that use sentence/document embeddings rather than token embeddings?

I would really appreciate any kind of help. Thanks a lot!

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Jan 22, 2021

The [CLS] token was not randomly initialized. It's a token in BERT vocabulary.
We talk about Pooling Layer in here.

@mscherrmann
Copy link

Oh okay, I see. Only the weight matrix and the bias vector of that feed forward operation on the [CLS] vector are randomly initalized, not the [CLS] vector itself. I misunderstood a comment in another forum. Thanks for clarification @wlhgtc!

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants