Loss pooling layer parameters after Fine-tune. #8793

wlhgtc · 2020-11-26T06:40:20Z

According to the code: if I want to fine-tune BERT with LM, we don't init pooling layer.
So we loss the original(pre-trained by Google) parameters if we save the fine-tune model and reload it.
Mostly, we use this model for downstream task( text classification), this (may) lead to a worse result.
This add_pooling_layer should be true for all time even if we don't update them in fine-tune.
@thomwolf @LysandreJik

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-11-27T17:33:29Z

The pooling layer is not used during the fine-tuning if doing MLM, so gradients are not retro-propagated through that layer; the parameters are not updated.

wlhgtc · 2020-11-28T00:05:38Z

@LysandreJik The pooling parameters are not needed in MLM fine-tune. But usually, we use MLM to fine-tune BERT on our own corpus, then we use the saved model weight(missed pooling parameters) in downstream task.
It's unreasonable for us to random initialize the pool parameters, we should reload google's original pooling parameter(though it was not update in MLM fine-tune).

LysandreJik · 2020-11-29T16:30:22Z

I see, thank you for explaining! In that case, would using the BertForPreTraining model fit your needs? You would only need to pass the masked LM labels, not the NSP labels, but you would still have all the layers that were used for the pre-training.

This is something we had not taken into account when implementing the add_pooling_layer argument cc @patrickvonplaten @sgugger

mscherrmann · 2021-01-22T07:30:10Z

Hi @LysandreJik,

I also tried to further pre-train BERT with new, domain specific text data using the recommended run_mlm_wwm.py file, since I read a paper which outlines the benefits of this approach. I also got the warning that the Pooling Layers are not initialized from the model checkpoint. I have a few follow up questions to that:

Does that mean that the final hidden vector of the [CLS] token is randomly initialized? That would be an issue for me since I need it in my downstream application.
If the former point is true: Why is not at least the hidden vector of the source model copied?
I think to get a proper hidden vector for [CLS], NSP would be needed. If I understand your answers in issue BertForPreTraining with NSP #6330 correctly, you don't support the NSP objective due to the results of the RoBERTa paper. Does that mean there is no code for pre-training BERT in the whole huggingface library which yields meaningful final [CLS] hidden vectors?
Is there an alternative to [CLS] for downstream tasks that use sentence/document embeddings rather than token embeddings?

I would really appreciate any kind of help. Thanks a lot!

wlhgtc · 2021-01-22T07:42:00Z

The [CLS] token was not randomly initialized. It's a token in BERT vocabulary.
We talk about Pooling Layer in here.

mscherrmann · 2021-01-22T08:00:16Z

Oh okay, I see. Only the weight matrix and the bias vector of that feed forward operation on the [CLS] vector are randomly initalized, not the [CLS] vector itself. I misunderstood a comment in another forum. Thanks for clarification @wlhgtc!

github-actions · 2021-03-06T00:15:29Z

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

wlhgtc mentioned this issue Dec 2, 2020

'Some weights of BertModel were not initialized from the model checkpoint at ./model and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']' #8887

Closed

LysandreJik mentioned this issue Dec 7, 2020

Optional layers #8961

Merged

github-actions bot added the wontfix label Mar 6, 2021

github-actions bot closed this as completed Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss pooling layer parameters after Fine-tune. #8793

Loss pooling layer parameters after Fine-tune. #8793

wlhgtc commented Nov 26, 2020 •

edited

Loading

LysandreJik commented Nov 27, 2020 •

edited

Loading

wlhgtc commented Nov 28, 2020

LysandreJik commented Nov 29, 2020

mscherrmann commented Jan 22, 2021

wlhgtc commented Jan 22, 2021

mscherrmann commented Jan 22, 2021

github-actions bot commented Mar 6, 2021

Loss pooling layer parameters after Fine-tune. #8793

Loss pooling layer parameters after Fine-tune. #8793

Comments

wlhgtc commented Nov 26, 2020 • edited Loading

LysandreJik commented Nov 27, 2020 • edited Loading

wlhgtc commented Nov 28, 2020

LysandreJik commented Nov 29, 2020

mscherrmann commented Jan 22, 2021

wlhgtc commented Jan 22, 2021

mscherrmann commented Jan 22, 2021

github-actions bot commented Mar 6, 2021

wlhgtc commented Nov 26, 2020 •

edited

Loading

LysandreJik commented Nov 27, 2020 •

edited

Loading