Added support for biomedical datasets with multiple entity types #3387

WangXII · 2023-12-21T12:48:13Z

In this pull request, we've added support for datasets annotating multiple entities like BC5CDR for chemical and disease entities.

This enables training of models which can recognize multiple entites at once without the need for specialized model for every entity type. Exemplary for this implementation is the class HUNER_ALL_BIORED, which tags genes, chemicals, cell lines, diseases and species in a single corpora.

…: Directly training on multiple BigBio datasets?

mariosaenger

@WangXII thank you very much for implementing all the changes and going through all datasets.

Concerning the integration of the task description, however, I would opt for another solution (and not materialzing the prompts into the conll files), since at the end the user should be able to use the following api:

ner_tagger = Classifier.load("hunflair2")

sentence = Sentence("TP53 is an onco-gene.")
ner_tagger.predict(sentence)

This means, the end-user should not responsible / bothered with prepending the task prompts etc. I will commit a suggestion for an implementation right away.

mariosaenger · 2024-01-05T13:23:46Z

flair/datasets/biomedical.py

@@ -46,6 +46,9 @@

 SENTENCE_TAG = "[__SENT__]"

+MULTI_TASK_LEARNING = False


I think a global flag isn't a proper solution. We should refactor this, so that it becomes a parameter of the ConllWriter. Moreover, I'm not completely convinced by the strategy to materialize the task description, i.e., write the task prompt to the conll file.

mariosaenger · 2024-01-05T13:25:22Z

flair/datasets/biomedical.py

@@ -46,6 +46,9 @@

 SENTENCE_TAG = "[__SENT__]"

+MULTI_TASK_LEARNING = False
+IGNORE_NEGATIVE_SAMPLES = False


If I'm getting it right this flag is used to remove sentences without any labels. Why is this necessary? Are their (strong) performance differences?

…ith task description-augmented sentences.

…move to own package

alanakbik · 2024-02-12T08:30:44Z

Cool new feature! Thanks @WangXII and @mariosaenger for adding this!

Xing Wang added 7 commits April 20, 2023 10:37

recent changes pushed

db028d5

added support for training directly on one given BigBio dataset. TODO…

6190bb4

…: Directly training on multiple BigBio datasets?

removed debug link for tmvar_v3 dataset

97d7905

cleaning up repository

a4deab5

updated for formatting changes

3f40196

Merge branch 'master' into multi_entity_hunflair

73986e0

merged master and prepare for push request

280701b

WangXII requested a review from mariosaenger December 21, 2023 12:48

mariosaenger reviewed Jan 9, 2024

View reviewed changes

Mario Sänger and others added 13 commits January 9, 2024 18:12

Add first implementation for a sequence tagger model which can work w…

b9abf0b

…ith task description-augmented sentences.

fixed prediction and evaluation on new API

2fe2fd0

removed try error case

fdb5ec9

Fix model loading + add more unit tests

1ea549a

Fix code style and mypy issues

96c1eaf

Simplify persisting and loading models

897b3c2

small changes

3f3361a

Fix formatting

46d8909

Make black happy

da25bb1

Make mypy happy

d2dfb11

Rename AugmentedSentenceSequenceTagger to PrefixedSequenceTagger and …

d501f58

…move to own package

Rename AugmentedSentence to PrefixedSentence

037862e

Rename classes also in comments

8e9eeb9

alanakbik merged commit d55c0e9 into master Feb 12, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for biomedical datasets with multiple entity types #3387

Added support for biomedical datasets with multiple entity types #3387

WangXII commented Dec 21, 2023

mariosaenger left a comment

mariosaenger Jan 5, 2024

mariosaenger Jan 5, 2024

alanakbik commented Feb 12, 2024

		@@ -46,6 +46,9 @@

		SENTENCE_TAG = "[__SENT__]"

		MULTI_TASK_LEARNING = False

Added support for biomedical datasets with multiple entity types #3387

Added support for biomedical datasets with multiple entity types #3387

Conversation

WangXII commented Dec 21, 2023

mariosaenger left a comment

Choose a reason for hiding this comment

mariosaenger Jan 5, 2024

Choose a reason for hiding this comment

mariosaenger Jan 5, 2024

Choose a reason for hiding this comment

alanakbik commented Feb 12, 2024