For custom RE dataset with entity marked in advanced #166

KennyNg-19 · 2021-08-17T12:00:54Z

Hi, as a green hand, I would like to ask some naive questions: for fine-tuning on a custom RE dataset with entity marked in advanced,

do we need to constrain what kind of entity marker or dummy words used for the BioNLP when marking the entity(e.g. @disease$, [e] some disease [/e])?
when preprocessing, do we need add some code for helping the model to tokenized the entity? e.g. if we using [E1] to mark the entity, let the tokenizer knows it:

tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])

Hi Chloe,

Yes, you need to input task_name. If your dataset is a task of binary classification, you can use either of them. Basically, euadr and gad are processed in the same way (using BioBERTProcessor).

biobert/run_re.py

Lines 914 to 917 in 37599fb

"gad": BioBERTProcessor,

"polysearch": BioBERTProcessor,

"mirnadisease": BioBERTProcessor,

"euadr": BioBERTProcessor,

Please be noticed that, however, chemprot dataset is a multi-class classification task. Hence it is processed in a different way and the same holds for the evaluation script.
Thank you for your interest in our work!
Best,
WonJin

The text was updated successfully, but these errors were encountered:

wonjininfo · 2021-10-27T08:30:49Z

Hi Kenny,
Thank you for your interest in our paper and my apologies for the delay in response.

You can use any entity marker or dummy words but please refrain from using some popular words.
In my case, I utilized synthetic words like ENToGENEoMK. You need to register these words in the vocab. (Please see the next paragraph)

In order to add a custom token to the tokenizer, (for this repo; TensorFlow version) you need to modify vocab.txt. If you open the vocab.txt file, you can see the reserved [unused1] tokens at the beginning. You can replace these tokens with your custom tokens.
I think your code tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]']) is for HuggingFace framework. Please check https://github.com/dmis-lab/biobert-pytorch for the pytorch-HuggingFace version codes.

Thank you and once again, sorry for the delay in response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For custom RE dataset with entity marked in advanced #166

For custom RE dataset with entity marked in advanced #166

KennyNg-19 commented Aug 17, 2021 •

edited

Loading

wonjininfo commented Oct 27, 2021

For custom RE dataset with entity marked in advanced #166

For custom RE dataset with entity marked in advanced #166

Comments

KennyNg-19 commented Aug 17, 2021 • edited Loading

wonjininfo commented Oct 27, 2021

KennyNg-19 commented Aug 17, 2021 •

edited

Loading