-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TF Funnel Transformer #7029
Conversation
@@ -598,8 +598,8 @@ def __init__(self, config, block_index): | |||
self.attention = FunnelRelMultiheadAttention(config, block_index) | |||
self.ffn = FunnelPositionwiseFFN(config) | |||
|
|||
def forward(self, q, k, v, attention_inputs, output_attentions=False): | |||
attn = self.attention(q, k, v, attention_inputs, output_attentions=output_attentions) | |||
def forward(self, query, key, value, attention_inputs, output_attentions=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten This is just for you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha :-)
src/transformers/file_utils.py
Outdated
@@ -133,7 +133,7 @@ | |||
MODEL_CARD_NAME = "modelcard.json" | |||
|
|||
|
|||
MULTIPLE_CHOICE_DUMMY_INPUTS = [[[0], [1]], [[0], [1]]] | |||
MULTIPLE_CHOICE_DUMMY_INPUTS = [[[0, 1, 2, 3], [1, 2, 3, 4]] * 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnel Transformer model pools twice, so it needs a sequence length of 4 (minimum) to work properly.
Codecov Report
@@ Coverage Diff @@
## master #7029 +/- ##
==========================================
+ Coverage 78.37% 80.90% +2.52%
==========================================
Files 164 165 +1
Lines 31026 31767 +741
==========================================
+ Hits 24318 25702 +1384
+ Misses 6708 6065 -643
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!! I really like the fact that you have done the TensorFlow version as well! And also to create each task model. Thank you very much.
I left few minor comments to address, but in overall it is great.
Nevertheless, the TF part should wait my output refactoring before merging, or you can start to integrate it :)
def call(self, query, key, value, attention_inputs, output_attentions=False, training=False): | ||
# query has shape batch_size x seq_len x d_model | ||
# key and value have shapes batch_size x context_len x d_model | ||
position_embeds, token_type_mat, attention_mask, cls_mask = attention_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need to be more carefuly handled, what if attention_inputs
gets only one value? at least checking the size in order to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message if something of the wrong length is clear enough. Adding an assert on the length would add nothing more to the end user.
attention_inputs = self.attention_structure.init_attention_inputs( | ||
inputs_embeds, | ||
attention_mask=attention_mask, | ||
token_type_ids=token_type_ids, | ||
training=training, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a proper way to call a layer. See first comment.
return tf.reshape(logits, [batch_size, length, self.vocab_size]) | ||
|
||
|
||
class TFFunnelAttentionStructure(tf.keras.layers.Layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is not TF compliant. A layer has to have a call
method in order to respect the requirements when making the custom layers. Otherwise this layer cannot be used properly by TF. For example, if someone wants to reuse this class as it is for his own model with the usual compile
+ fit
approach? It simply won't work.
As it is composed of multiple functions, it should be split into multiple layers. Otherwise you have to make it an utils class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a bunch of util functions. The reason it ended up as a keras layer is because it's a nn.Module on the PyTorch side because it has two dropouts (to handle the training/eval). It should probably work as a util class here since the training flag is passed along as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making it an util class, is the best compromise, and won't lead to ambiguous usage 👍
all_attentions = () if output_attentions else None | ||
|
||
for block_index, block in enumerate(self.blocks): | ||
pooling_flag = hidden.shape[1] > (2 if self.separate_cls else 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful tensor.shape
won't work the same way depending of the execution mode. To get the shape there is the shape_list
function in modeling_tf_utils
.
pooling_flag = hidden.shape[1] > (2 if self.separate_cls else 1) | ||
pooling_flag = pooling_flag and block_index > 0 | ||
if pooling_flag: | ||
pooled_hidden, attention_inputs = self.attention_structure.pre_attention_pooling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. See first comment.
) | ||
hidden = layer_output[0] | ||
if do_pooling: | ||
attention_inputs = self.attention_structure.post_attention_pooling(attention_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. See first comment.
all_hidden_states = (hidden,) if output_hidden_states else None | ||
all_attentions = () if output_attentions else None | ||
|
||
attention_inputs = self.attention_structure.init_attention_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. See first comment.
output_type=TFBaseModelOutput, | ||
config_class=_CONFIG_FOR_DOC, | ||
) | ||
def call(self, inputs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be replaced by
self,
inputs,
attention_mask=None,
token_type_ids=None,
inputs_embeds=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
training=False,
Otherwise the kwargs
are not handled in TFFunnelBaseLayer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jplu The kwargs
are passed to the base layer so they're handled by the TFFunnelBaseLayer
. We do the same thing for TFBertModel
, see here:
transformers/src/transformers/modeling_tf_bert.py
Lines 792 to 806 in 48ff6d5
class TFBertModel(TFBertPreTrainedModel): | |
def __init__(self, config, *inputs, **kwargs): | |
super().__init__(config, *inputs, **kwargs) | |
self.bert = TFBertMainLayer(config, name="bert") | |
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) | |
@add_code_sample_docstrings( | |
tokenizer_class=_TOKENIZER_FOR_DOC, | |
checkpoint="bert-base-cased", | |
output_type=TFBaseModelOutputWithPooling, | |
config_class=_CONFIG_FOR_DOC, | |
) | |
def call(self, inputs, **kwargs): | |
outputs = self.bert(inputs, **kwargs) | |
return outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I'm not a big fan of the usual
TF signature, I prefer to be as explicit as possible and to have the same one across all the models and I will certainly plan to do it for the others.
Anyway, this is really minor and doesn't really matter for this PR :)
output_type=TFBaseModelOutput, | ||
config_class=_CONFIG_FOR_DOC, | ||
) | ||
def call(self, inputs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here for the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, great work!
def _resize_token_embeddings(self, new_num_tokens): | ||
raise NotImplementedError # Not implemented yet in the library fr TF 2.0 models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implemented! See #4351
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh then we need to update the template ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we really do!
output_type=TFBaseModelOutput, | ||
config_class=_CONFIG_FOR_DOC, | ||
) | ||
def call(self, inputs, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jplu The kwargs
are passed to the base layer so they're handled by the TFFunnelBaseLayer
. We do the same thing for TFBertModel
, see here:
transformers/src/transformers/modeling_tf_bert.py
Lines 792 to 806 in 48ff6d5
class TFBertModel(TFBertPreTrainedModel): | |
def __init__(self, config, *inputs, **kwargs): | |
super().__init__(config, *inputs, **kwargs) | |
self.bert = TFBertMainLayer(config, name="bert") | |
@add_start_docstrings_to_callable(BERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) | |
@add_code_sample_docstrings( | |
tokenizer_class=_TOKENIZER_FOR_DOC, | |
checkpoint="bert-base-cased", | |
output_type=TFBaseModelOutputWithPooling, | |
config_class=_CONFIG_FOR_DOC, | |
) | |
def call(self, inputs, **kwargs): | |
outputs = self.bert(inputs, **kwargs) | |
return outputs |
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add TF Funnel Transformer * Proper dummy input * Formatting * Update src/transformers/modeling_tf_funnel.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * One review comment forgotten Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add TF Funnel Transformer * Proper dummy input * Formatting * Update src/transformers/modeling_tf_funnel.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * One review comment forgotten Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Add TF Funnel Transformer * Proper dummy input * Formatting * Update src/transformers/modeling_tf_funnel.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address review comments * One review comment forgotten Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This reverts commit d198047.
This adds the TF implementation of the model. Will upload the TF checkpoints as the PR goes under review.