Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a dataset loader for GAD corpus #608

Closed
ruisi-su opened this issue May 21, 2022 · 4 comments
Closed

Create a dataset loader for GAD corpus #608

ruisi-su opened this issue May 21, 2022 · 4 comments
Assignees
Labels
CC BY 4.0 Licence High Priority RE Task

Comments

@ruisi-su
Copy link
Collaborator

Adding a Dataset

  • Name: Genetic Association Database Corpus
  • Description: a corpus identifying associations between genes and diseases by a semi-automatic annotation procedure based on the Genetic Association Database
  • Task: RE
  • Paper: paper
  • Data: data from this issue
  • License: CC BY 4.0
  • Motivation: part of BLURB
@ruisi-su ruisi-su added High Priority CC BY 4.0 Licence RE Task labels May 21, 2022
@tmabraham
Copy link

FYI - the issue you linked clearly highlighted this data has questionable labels as a weakly-labeled dataset.

The author of the paper also said:

To conclude, I agree that the GAD and EUADR datasets are weakly supervised (distant supervision) datasets. And since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs.

@ruisi-su
Copy link
Collaborator Author

ruisi-su commented May 23, 2022

Thanks @tmabraham for looking into this. I remember about GAD's labels generating confusions when I was tracking down this dataset. This issue was created to stay consistent with the BLURB dataset. However, I think your point (along with others' concern about this dataset) is very valid. We will discuss and get back to you on this!

@jason-fries
Copy link
Member

Actually @ruisi-su @tmabraham , can we keep this as high priority for implementation? The only valid reason to deprioritize a dataset used in a standard benchmark is if that dataset isn’t public. More generally, as a research question, we’re interested in models trained with labels with different provenance (e.g., weakly supervised) to measure performance tradeoffs. From this perspective, datasets like these are quite valuable.

@SamuelCahyawijaya
Copy link
Contributor

#self-assign

ruisi-su pushed a commit to ruisi-su/biomedical that referenced this issue May 27, 2022
* add GAD dataset

* update metadata on gad.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CC BY 4.0 Licence High Priority RE Task
Projects
Development

No branches or pull requests

4 participants