Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty dev.tsv files in the GAD dataset after download #41

Open
MaxwellWibert opened this issue Aug 8, 2023 · 2 comments
Open

Empty dev.tsv files in the GAD dataset after download #41

MaxwellWibert opened this issue Aug 8, 2023 · 2 comments

Comments

@MaxwellWibert
Copy link

I cloned the repo and ran download.sh, and found a dev.tsv file in each of the numbered folders, however each of those files was totally empty. Is there some other preprocessing script that is responsible for populating these files?

@wonjininfo
Copy link
Member

wonjininfo commented Aug 8, 2023

Hi Maxwell,
For the GAD dataset, we chose to evaluate our model using the 10-fold cross-validation method, as it is a very small dataset. Therefore, there is no fixed division table for Train-Dev-Test nor we have dev.tsv.

Unfortunately, GAD might not be the most ideal resource for evaluating LMs. However, in the five years since the BioBERT paper was published, there have been significant efforts in creating resources for relation extraction in NLP. This has led to the availability of other relatively abundant resources for BioRE (to name a few: DrugProt, BioRED).

@MaxwellWibert
Copy link
Author

MaxwellWibert commented Aug 9, 2023

Thank you for your response! We agree GAD is not ideal as an LM evaluation dataset, however, both BioBERT and its derivatives have become common benchmarks in the field, and so often we must try to recreate your original datasets. I'm afraid my institution's decision to use GAD as a benchmarking set is above my paygrade.

Maybe this is a silly question , but did you generate the 10-fold cross-validation by just looping over the 10 train.tsv files and setting the current file to be the validation set?

In other words, is the k-fold structured as follows?
first iteration 1/train.tsv for validation, 2/train.tsv through 10/train.tsv for training
...
ith iteration, i/train.tsv for validation, k/train.tsv for k !=i used for training
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants