Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for biomedical datasets with multiple entity types #3387

Merged
merged 20 commits into from
Feb 12, 2024

Conversation

WangXII
Copy link
Collaborator

@WangXII WangXII commented Dec 21, 2023

In this pull request, we've added support for datasets annotating multiple entities like BC5CDR for chemical and disease entities.

This enables training of models which can recognize multiple entites at once without the need for specialized model for every entity type. Exemplary for this implementation is the class HUNER_ALL_BIORED, which tags genes, chemicals, cell lines, diseases and species in a single corpora.

Copy link
Collaborator

@mariosaenger mariosaenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WangXII thank you very much for implementing all the changes and going through all datasets.

Concerning the integration of the task description, however, I would opt for another solution (and not materialzing the prompts into the conll files), since at the end the user should be able to use the following api:

ner_tagger = Classifier.load("hunflair2")

sentence = Sentence("TP53 is an onco-gene.")
ner_tagger.predict(sentence)

This means, the end-user should not responsible / bothered with prepending the task prompts etc. I will commit a suggestion for an implementation right away.

@@ -46,6 +46,9 @@

SENTENCE_TAG = "[__SENT__]"

MULTI_TASK_LEARNING = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a global flag isn't a proper solution. We should refactor this, so that it becomes a parameter of the ConllWriter. Moreover, I'm not completely convinced by the strategy to materialize the task description, i.e., write the task prompt to the conll file.

@@ -46,6 +46,9 @@

SENTENCE_TAG = "[__SENT__]"

MULTI_TASK_LEARNING = False
IGNORE_NEGATIVE_SAMPLES = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm getting it right this flag is used to remove sentences without any labels. Why is this necessary? Are their (strong) performance differences?

@alanakbik alanakbik merged commit d55c0e9 into master Feb 12, 2024
1 check passed
@alanakbik
Copy link
Collaborator

Cool new feature! Thanks @WangXII and @mariosaenger for adding this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants