Create a dataset loader for GAD corpus #608

ruisi-su · 2022-05-21T15:28:59Z

Adding a Dataset

Name: Genetic Association Database Corpus
Description: a corpus identifying associations between genes and diseases by a semi-automatic annotation procedure based on the Genetic Association Database
Task: RE
Paper: paper
Data: data from this issue
License: CC BY 4.0
Motivation: part of BLURB

tmabraham · 2022-05-23T08:03:37Z

FYI - the issue you linked clearly highlighted this data has questionable labels as a weakly-labeled dataset.

The author of the paper also said:

To conclude, I agree that the GAD and EUADR datasets are weakly supervised (distant supervision) datasets. And since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs.

RE datasets dmis-lab/biobert#162 (comment)

ruisi-su · 2022-05-23T13:58:23Z

Thanks @tmabraham for looking into this. I remember about GAD's labels generating confusions when I was tracking down this dataset. This issue was created to stay consistent with the BLURB dataset. However, I think your point (along with others' concern about this dataset) is very valid. We will discuss and get back to you on this!

jason-fries · 2022-05-23T14:38:38Z

Actually @ruisi-su @tmabraham , can we keep this as high priority for implementation? The only valid reason to deprioritize a dataset used in a standard benchmark is if that dataset isn’t public. More generally, as a research question, we’re interested in models trained with labels with different provenance (e.g., weakly supervised) to measure performance tradeoffs. From this perspective, datasets like these are quite valuable.

SamuelCahyawijaya · 2022-05-26T04:45:33Z

#self-assign

* add GAD dataset * update metadata on gad.py

ruisi-su added High Priority CC BY 4.0 Licence RE Task labels May 21, 2022

github-actions bot assigned SamuelCahyawijaya May 26, 2022

ruisi-su closed this as completed in 76e1374 May 26, 2022

ruisi-su pushed a commit to ruisi-su/biomedical that referenced this issue May 27, 2022

Closes bigscience-workshop#608 | GAD Corpus (bigscience-workshop#634)

c3ae057

* add GAD dataset * update metadata on gad.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a dataset loader for GAD corpus #608

Create a dataset loader for GAD corpus #608

ruisi-su commented May 21, 2022

tmabraham commented May 23, 2022

ruisi-su commented May 23, 2022 •

edited

Loading

jason-fries commented May 23, 2022

SamuelCahyawijaya commented May 26, 2022

Create a dataset loader for GAD corpus #608

Create a dataset loader for GAD corpus #608

Comments

ruisi-su commented May 21, 2022

Adding a Dataset

tmabraham commented May 23, 2022

ruisi-su commented May 23, 2022 • edited Loading

jason-fries commented May 23, 2022

SamuelCahyawijaya commented May 26, 2022

ruisi-su commented May 23, 2022 •

edited

Loading