Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3095: add dual encoder #3208

Merged
merged 16 commits into from
Apr 27, 2023
Merged

#3095: add dual encoder #3208

merged 16 commits into from
Apr 27, 2023

Conversation

whoisjones
Copy link
Member

closes #3095.

  • Adds Dual Encoder class
  • two minor typing reformatting in decoder.py at Prototypical Decoder
  • added no_header attribute to WordEmbeddings. Changes nothing but the possibility to load glove vectors as well that are not in word2vec format
  • made create_internal_label_dictionary static from TokenClassifer so that it can be used outside of class (required for Dual Encoder on token level)
  • removed self.training in get_data_points_from_sentences. otherwise spans will be wrongly created in eval mode. + possibility for BIO encoding
  • removed TODO in model.py since flake8 complained

@@ -108,18 +109,22 @@ def _get_embedding_for_data_point(self, prediction_data_point: Token) -> torch.T

def _get_data_points_from_sentence(self, sentence: Sentence) -> List[Token]:
# special handling during training if this is a span prediction problem
if self.training and self.span_prediction_problem:
if self.span_prediction_problem: # do we need self.training here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion is only necessary during training: we take Span labels and encode them as Token-level labels. During prediction, this is not necessary.

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Changes requested since there are some important type declarations missing, and one unnecessary if-statement.

Additionally: have you tested "cosine-similarity"? Is that working?



class LabelVerbalizerDecoder(torch.nn.Module):
def __init__(self, label_encoder, label_dictionary: Dictionary, decoding: str = "dot-product"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hint for label_encoder is missing. Perhaps also rename to label_embedding for clarity?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be DocumentEmbedding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocumentEmbeddings raise ImportError, looks like circular dependency.
Embeddings as used in nn/model.py instead of DocumentEmbeddings works. If we do not want to take Embeddings, I can open up new issue to inspect this.

tests/conftest.py:6: in <module>
    import flair
flair/__init__.py:28: in <module>
    from . import (  # noqa: E402 import after setting device
flair/models/__init__.py:1: in <module>
    from .clustering import ClusteringModel
flair/models/clustering.py:14: in <module>
    from flair.embeddings import DocumentEmbeddings
flair/embeddings/__init__.py:13: in <module>
    from .document import (
flair/embeddings/document.py:21: in <module>
    from flair.nn import LockedDropout, WordDropout
flair/nn/__init__.py:1: in <module>
    from .decoder import LabelVerbalizerDecoder, PrototypicalDecoder
flair/nn/decoder.py:15: in <module>
    from flair.embeddings import DocumentEmbeddings
E   ImportError: cannot import name 'DocumentEmbeddings' from 'flair.embeddings' (/Users/jgolde/PycharmProjects/flair/flair/embeddings/__init__.py)```


label_tensor = torch.stack([label.get_embedding() for label in self.verbalized_labels])

if self.training or not self.label_encoder._everything_embedded(self.verbalized_labels):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second condition is not needed: During, training always store embeddings. Otherwise, do not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how should the decoder know about the embeddings_storage_mode? Trainer stores embeddings depending on the mode, thus we would need to include this param into the forward loss called in trainer.train().

if decoding not in ["dot-product", "cosine-similarity"]:
raise RuntimeError("Decoding method needs to be one of the following: dot-product, cosine-similarity")
self.label_encoder = label_encoder
self.verbalized_labels = self.verbalize_labels(label_dictionary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should be declared so it becomes easier to understand what verbalized_labels is (List[Sentence])

- type hints added
- removed unnecessary checks
- renamed attributes
@whoisjones
Copy link
Member Author

I have removed the cosine logic, it worked but just if we adjust some functions of the default model. I am currently experimenting with it and will open a new branch with it.

@alanakbik alanakbik merged commit d7acfd6 into master Apr 27, 2023
@alanakbik alanakbik deleted the dual_encoder branch April 27, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add Dual Encoder
2 participants