-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss pooling layer parameters after Fine-tune. #8793
Comments
The pooling layer is not used during the fine-tuning if doing MLM, so gradients are not retro-propagated through that layer; the parameters are not updated. |
@LysandreJik The pooling parameters are not needed in MLM fine-tune. But usually, we use MLM to fine-tune BERT on our own corpus, then we use the saved model weight(missed pooling parameters) in downstream task. |
I see, thank you for explaining! In that case, would using the This is something we had not taken into account when implementing the |
Hi @LysandreJik, I also tried to further pre-train BERT with new, domain specific text data using the recommended run_mlm_wwm.py file, since I read a paper which outlines the benefits of this approach. I also got the warning that the Pooling Layers are not initialized from the model checkpoint. I have a few follow up questions to that:
I would really appreciate any kind of help. Thanks a lot! |
The [CLS] token was not randomly initialized. It's a token in BERT vocabulary. |
Oh okay, I see. Only the weight matrix and the bias vector of that feed forward operation on the [CLS] vector are randomly initalized, not the [CLS] vector itself. I misunderstood a comment in another forum. Thanks for clarification @wlhgtc! |
This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions. If you think this still needs to be addressed please comment on this thread. |
According to the code: if I want to fine-tune BERT with LM, we don't init pooling layer.
So we loss the original(pre-trained by Google) parameters if we save the fine-tune model and reload it.
Mostly, we use this model for downstream task( text classification), this (may) lead to a worse result.
This
add_pooling_layer
should betrue
for all time even if we don't update them in fine-tune.@thomwolf @LysandreJik
The text was updated successfully, but these errors were encountered: