#3095: add dual encoder #3208

whoisjones · 2023-04-20T17:04:38Z

closes #3095.

Adds Dual Encoder class
two minor typing reformatting in decoder.py at Prototypical Decoder
added no_header attribute to WordEmbeddings. Changes nothing but the possibility to load glove vectors as well that are not in word2vec format
made create_internal_label_dictionary static from TokenClassifer so that it can be used outside of class (required for Dual Encoder on token level)
removed self.training in get_data_points_from_sentences. otherwise spans will be wrongly created in eval mode. + possibility for BIO encoding
removed TODO in model.py since flake8 complained

…encoder

alanakbik · 2023-04-21T11:51:35Z

flair/models/word_tagger_model.py

@@ -108,18 +109,22 @@ def _get_embedding_for_data_point(self, prediction_data_point: Token) -> torch.T

    def _get_data_points_from_sentence(self, sentence: Sentence) -> List[Token]:
        # special handling during training if this is a span prediction problem
-        if self.training and self.span_prediction_problem:
+        if self.span_prediction_problem:  # do we need self.training here?


The conversion is only necessary during training: we take Span labels and encode them as Token-level labels. During prediction, this is not necessary.

addition of self.training in word_tagger_model.py.

alanakbik

Thanks for adding this! Changes requested since there are some important type declarations missing, and one unnecessary if-statement.

Additionally: have you tested "cosine-similarity"? Is that working?

alanakbik · 2023-04-21T18:59:07Z

flair/nn/decoder.py

+
+
+class LabelVerbalizerDecoder(torch.nn.Module):
+    def __init__(self, label_encoder, label_dictionary: Dictionary, decoding: str = "dot-product"):


Type hint for label_encoder is missing. Perhaps also rename to label_embedding for clarity?

Should be DocumentEmbedding

DocumentEmbeddings raise ImportError, looks like circular dependency.
Embeddings as used in nn/model.py instead of DocumentEmbeddings works. If we do not want to take Embeddings, I can open up new issue to inspect this.

tests/conftest.py:6: in <module> import flair flair/__init__.py:28: in <module> from . import ( # noqa: E402 import after setting device flair/models/__init__.py:1: in <module> from .clustering import ClusteringModel flair/models/clustering.py:14: in <module> from flair.embeddings import DocumentEmbeddings flair/embeddings/__init__.py:13: in <module> from .document import ( flair/embeddings/document.py:21: in <module> from flair.nn import LockedDropout, WordDropout flair/nn/__init__.py:1: in <module> from .decoder import LabelVerbalizerDecoder, PrototypicalDecoder flair/nn/decoder.py:15: in <module> from flair.embeddings import DocumentEmbeddings E ImportError: cannot import name 'DocumentEmbeddings' from 'flair.embeddings' (/Users/jgolde/PycharmProjects/flair/flair/embeddings/__init__.py)```

alanakbik · 2023-04-27T08:15:09Z

flair/nn/decoder.py

+
+        label_tensor = torch.stack([label.get_embedding() for label in self.verbalized_labels])
+
+        if self.training or not self.label_encoder._everything_embedded(self.verbalized_labels):


The second condition is not needed: During, training always store embeddings. Otherwise, do not.

how should the decoder know about the embeddings_storage_mode? Trainer stores embeddings depending on the mode, thus we would need to include this param into the forward loss called in trainer.train().

alanakbik · 2023-04-27T08:22:56Z

flair/nn/decoder.py

+        if decoding not in ["dot-product", "cosine-similarity"]:
+            raise RuntimeError("Decoding method needs to be one of the following: dot-product, cosine-similarity")
+        self.label_encoder = label_encoder
+        self.verbalized_labels = self.verbalize_labels(label_dictionary)


The type should be declared so it becomes easier to understand what verbalized_labels is (List[Sentence])

- type hints added - removed unnecessary checks - renamed attributes

whoisjones · 2023-04-27T10:33:10Z

I have removed the cosine logic, it worked but just if we adjust some functions of the default model. I am currently experimenting with it and will open a new branch with it.

whoisjones and others added 10 commits April 19, 2023 12:06

merge token classifier

86bce1a

initial working version of SiameseDecoder / DualEncoder

9ca92b6

cosine siamese working version

4cae9f5

cosine siamese working version

e095f46

Merge branch 'master' of https://github.com/flairNLP/flair into dual_…

0b62635

…encoder

Merge branch 'master' into dual_encoder

bbde852

merge classes into one

e5d9e91

Merge remote-tracking branch 'upstream/master' into dual_encoder

4a395ad

fix tests

b4a9fe8

fix mypy issues

7bc8406

alanakbik reviewed Apr 21, 2023

View reviewed changes

whoisjones added 5 commits April 24, 2023 23:02

fixed eval logic (no need for re-embed labels) + docstrings.

381c618

addition of self.training in word_tagger_model.py.

fixed eval logic (no need for re-embed labels) + docstrings.

7ccacc1

addition of self.training in word_tagger_model.py.

fix flake8

b8613fa

fix docstrings

b60b921

fix docstrings

38e8fab

alanakbik requested changes Apr 27, 2023

View reviewed changes

- removed cosine, does not fit into predefined logic

a0f8410

- type hints added - removed unnecessary checks - renamed attributes

alanakbik merged commit d7acfd6 into master Apr 27, 2023

alanakbik deleted the dual_encoder branch April 27, 2023 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#3095: add dual encoder #3208

#3095: add dual encoder #3208

whoisjones commented Apr 20, 2023

alanakbik Apr 21, 2023

alanakbik left a comment

alanakbik Apr 21, 2023

alanakbik Apr 27, 2023

whoisjones Apr 27, 2023

alanakbik Apr 27, 2023

whoisjones Apr 27, 2023

alanakbik Apr 27, 2023

whoisjones commented Apr 27, 2023



		class LabelVerbalizerDecoder(torch.nn.Module):
		def __init__(self, label_encoder, label_dictionary: Dictionary, decoding: str = "dot-product"):


		label_tensor = torch.stack([label.get_embedding() for label in self.verbalized_labels])

		if self.training or not self.label_encoder._everything_embedded(self.verbalized_labels):

#3095: add dual encoder #3208

#3095: add dual encoder #3208

Conversation

whoisjones commented Apr 20, 2023

alanakbik Apr 21, 2023

Choose a reason for hiding this comment

alanakbik left a comment

Choose a reason for hiding this comment

alanakbik Apr 21, 2023

Choose a reason for hiding this comment

alanakbik Apr 27, 2023

Choose a reason for hiding this comment

whoisjones Apr 27, 2023

Choose a reason for hiding this comment

alanakbik Apr 27, 2023

Choose a reason for hiding this comment

whoisjones Apr 27, 2023

Choose a reason for hiding this comment

alanakbik Apr 27, 2023

Choose a reason for hiding this comment

whoisjones commented Apr 27, 2023