You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at the GAD RE data as used in the BioBert paper and linked to in this repo, I can't find any way of making sense of it. Specifically, what do labels 1 and 0 supposed to mean? The options I've considered are:
one of them (presumably 1) means sentence supports a relation between identified gene and disease, the other the opposite.
one of them (presumably 1) means sentence supports a positive relation, the other the opposite (i.e. negative relation or no relation).
Neither of these line up with the data I see.
Here are a few examples from the first 2 cross-validation splits (i.e. subfolders 1 and 2). These are not cherry picked, my only bias was looking for short sentences that are easy to wrap my head around :)
label sentence
0 No evidence was obtained that the studied polymorphism
in @GENE$ is a determinant of the coumarin-associated @DISEASE$.
0 There is no allelic association between @DISEASE$ and @GENE$
gene polymorphisms.
1 C1014T SNP of @GENE$ does not appear to be associated with
susceptibility to @DISEASE$ in Japanese patients.
1 These data strongly suggest that @GENE$ is not a significant
susceptibility allele for @DISEASE$.
0 our study does not support the notion that @GENE$ and HTR1A gene
variants are major contributors to @DISEASE$-, anger-, or
aggression-related behaviors in our sample.
1 it is unlikely that common variants in MLH1, MLH3, @GENE$, MSH2,
MSH3 and MSH6 contribute significantly to @DISEASE$ susceptibility.
0 The ESRRA23 and Pro116Pro variants of the gene encoding @GENE$ are
not associated with @DISEASE$, type 2 diabetes or related
quantitative traits in the examined Danish whites.
-----
0 Abnormal @GENE$ gene copy numbers are a genetic risk factor in @DISEASE$.
1 Presence of the @GENE$ gene promoter polymorphisms was found to
be a negative prognostic parameter in patients with @DISEASE$.
0 We conclude that the @GENE$ gene Bst U I polymorphism is a
suitable genetic marker of @DISEASE$.
0 We identified a polymorphism in the @GENE$ gene associated with @DISEASE$.
0 We conclude that @GENE$ is associated with both the development
of @DISEASE$ and ABO incompatibility.
1 The @GENE$ gene is likely to be involved in the genetic
vulnerability for @DISEASE$.
0 The @GENE$ Asp allele may be a genetic risk factor for @DISEASE$,
and might influence the course of Alzheimer disease, even though
effects vary in different studies.
1 We conclude that the @GENE$ gene may be a susceptibility gene for @DISEASE$.
1 In conclusion, we have replicated the association of the @GENE$ P2
promoter haplotype with @DISEASE$ in a U.K. Caucasian population
where there is no evidence of linkage to 20q.
0 Polymorphisms related to a functional decrease in ligand binding
activity of @GENE$ are associated with @DISEASE$ in U.
1 Variants of the ADRB2, ADRA1d and @GENE$ genes may be related to a
predisposition to @DISEASE$.
---
0 (*) The presence of the @GENE$ Met66 allele does not contribute to the
decreased level of @DISEASE$(67) mRNA expression in the prefrontal
cortex of subjects with schizophrenia.
Notes:
The first group are negative associations (i.e. when I read them as a human, I conclude negative association).
The second group are positive associations. As you can see there seems to be no rhyme or reason to the 0s and 1s.
The last row noted as (*) is the closest I've found to an example of no positive or negative association (just co-occurence, probably caused by NER misclassification).
Desired State
Either:
all samples in first half have the same label and all samples in second half have the opposite label.
all samples with the exception of (*) have the same label, and the (*) sample has the opposite label.
Where did the GAD RE dataset come from?
The GAD RE dataset has taken a life of its own (e.g. BLURB also uses it, and defers to BioBert for details), yet it's unclear to me what its origins are. It would be valuable to the community if the maintainers/authors (@wonjininfo ?) could elaborate on how this dataset came to be or at least confirm my understanding described below.
Here's what I could gather about the genealogy of the GAD RE dataset:
Becker et al. (2004) presented GAD (I'll call this NIH GAD), a semi-automatically curated repository of associations (positive and negative -- i.e. evidence of lack of association) between genes and human diseases. At the time of publication (2004), it contained >5,000 data points. The original GAD did not have supporting text, but just pubmed IDs.
The original NIH GAD was retired in 2014. There is a zip file dump of the latest state of the dataset and from a cursory look, confirms my understanding that NIH GAD did not have supporting text, only pubmed IDs.
Bravo et al. (2015) presented BeFree for biomedical RE. They also present a further-processed version of GAD (I'll call this BeFree GAD) to evaluate their model. They did some non-trivial work on the original NIH GAD data to produce a corpus of sentences with true/false labels (unlike NIH GAD's positive/negative labels) with this logic:
A (sentence, gene, disease) tuple is true if (roughly) NIH GAD contains a positive or negative assertion about that (gene, disease) pair citing the pubmed article containing that sentence.
The same is false if: despite gene and disease appearing in the sentence, NIH GAD does not note any positive/negative associations between them citing that article.
The original link for the BeFree GAD corpus is dead (but fwiw, the data presumably exists, in a new shape and form, in the DisGeNet project).
Lee, Yoon, et al. 2019 (i.e. BioBert) cite Brave et al. (2015) for their usage of GAD without any further details, presumably because all they did was divide it up for 10-fold cross-validation. And that's what is available at the link in README of this repo.
A Potential Issue
If all this is correct, the main issue I see is the definition of true/false as set by Bravo et al: the absence of an entry from NIH GAD is probably a poor proxy for whether a sentence supports an association for the purposes of RE.
The curators of NIH GAD could not have conceivably looked at all the extra "false" articles included in BeFree GAD. Additionally, NIH GAD was originally curated in 2004, and only updated until 2014. By its nature, NIH GAD probably has nothing to say about the majority of articles for most given (gene, disease) associations.
The text was updated successfully, but these errors were encountered:
amirkdv
changed the title
Questionable true labels in GAD RE dataset
Questionable training labels in GAD RE dataset
Dec 30, 2020
I've also just noticed this, the 0/1 labels seem to be made completely randomly.
It has been 6 months since this question was asked without response from the authors, to me this seems like a major issue with the relation extraction problem/results.
Unless I am missing something glaringly obvious with what relationship is trying to be extracted...
Current State
Looking at the GAD RE data as used in the BioBert paper and linked to in this repo, I can't find any way of making sense of it. Specifically, what do labels 1 and 0 supposed to mean? The options I've considered are:
Neither of these line up with the data I see.
Here are a few examples from the first 2 cross-validation splits (i.e. subfolders 1 and 2). These are not cherry picked, my only bias was looking for short sentences that are easy to wrap my head around :)
Notes:
Desired State
Either:
(*)
have the same label, and the(*)
sample has the opposite label.Where did the GAD RE dataset come from?
The GAD RE dataset has taken a life of its own (e.g. BLURB also uses it, and defers to BioBert for details), yet it's unclear to me what its origins are. It would be valuable to the community if the maintainers/authors (@wonjininfo ?) could elaborate on how this dataset came to be or at least confirm my understanding described below.
Here's what I could gather about the genealogy of the GAD RE dataset:
Becker et al. (2004) presented GAD (I'll call this NIH GAD), a semi-automatically curated repository of associations (positive and negative -- i.e. evidence of lack of association) between genes and human diseases. At the time of publication (2004), it contained >5,000 data points. The original GAD did not have supporting text, but just pubmed IDs.
The original NIH GAD was retired in 2014. There is a zip file dump of the latest state of the dataset and from a cursory look, confirms my understanding that NIH GAD did not have supporting text, only pubmed IDs.
Bravo et al. (2015) presented BeFree for biomedical RE. They also present a further-processed version of GAD (I'll call this BeFree GAD) to evaluate their model. They did some non-trivial work on the original NIH GAD data to produce a corpus of sentences with true/false labels (unlike NIH GAD's positive/negative labels) with this logic:
(sentence, gene, disease)
tuple is true if (roughly) NIH GAD contains a positive or negative assertion about that (gene, disease) pair citing the pubmed article containing that sentence.The original link for the BeFree GAD corpus is dead (but fwiw, the data presumably exists, in a new shape and form, in the DisGeNet project).
Lee, Yoon, et al. 2019 (i.e. BioBert) cite Brave et al. (2015) for their usage of GAD without any further details, presumably because all they did was divide it up for 10-fold cross-validation. And that's what is available at the link in
README
of this repo.A Potential Issue
If all this is correct, the main issue I see is the definition of true/false as set by Bravo et al: the absence of an entry from NIH GAD is probably a poor proxy for whether a sentence supports an association for the purposes of RE.
The curators of NIH GAD could not have conceivably looked at all the extra "false" articles included in BeFree GAD. Additionally, NIH GAD was originally curated in 2004, and only updated until 2014. By its nature, NIH GAD probably has nothing to say about the majority of articles for most given (gene, disease) associations.
The text was updated successfully, but these errors were encountered: