Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to esablish to which entities the Source and Target in a GO-CAMS relation point to? (Newbie question) #466

Open
zwurgl opened this issue Dec 12, 2023 · 5 comments

Comments

@zwurgl
Copy link

zwurgl commented Dec 12, 2023

I hope I can quickly ask a question on some available GO-CAMS data (https://geneontology.org/docs/download-go-cams/). I am new to GO so maybe this is all pretty evident but anyway:

I'm busy right now generating formal representations of relations between genes, proteins, molecules and diseases found in text documents (publications, etc) in a public research project. So we apply machine learning methods to read text and generate relations ("A upregulates B", "C inhibits D", ...). We have recently done that successfully on corpora from other formats (BEL, ...)

Now, a collaborator pointed out the resources in https://geneontology.org/docs/download-go-cams/ to me and suggested to check whether they can also be used as a training data set. In the ttl part of the page above, one finds among other things a few thousand relations as the one below:

_:b40[<http://purl.org/pav/providedBy>](http://purl.org/pav/providedBy)  ["http://www.wormbase.org"](http://www.wormbase.org/)  ;
              [<http://geneontology.org/lego/*evidence*>](http://geneontology.org/lego/*evidence*)
     [<http://model.geneontology.org/568b0f9600000284/5ce58dde00000278>](http://model.geneontology.org/568b0f9600000284/5ce58dde00000278)  ;
     [<http://www.w3.org/2002/07/owl#annotatedProperty>](http://www.w3.org/2002/07/owl#annotatedProperty)
     [<http://purl.obolibrary.org/obo/RO_0002629>](http://purl.obolibrary.org/obo/RO_0002629)  ;
              [<http://www.w3.org/2002/07/owl#annotated*Source*>](http://www.w3.org/2002/07/owl#annotated*Source*)
     [<http://model.geneontology.org/568b0f9600000284/57ec3a7e00000079>](http://model.geneontology.org/568b0f9600000284/57ec3a7e00000079)  ;
              [<http://www.w3.org/2002/07/owl#annotated*Target*>](http://www.w3.org/2002/07/owl#annotated*Target*)
     [<http://model.geneontology.org/568b0f9600000284/57ec3a7e00000109>](http://model.geneontology.org/568b0f9600000284/57ec3a7e00000109)  ;
     [<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>](http://www.w3.org/1999/02/22-rdf-syntax-ns#type)
     [<http://www.w3.org/2002/07/owl#Axiom>](http://www.w3.org/2002/07/owl#Axiom)  ;
              [<http://www.w3.org/2000/01/*rdf-schema#comment*>](http://www.w3.org/2000/01/*rdf-schema#comment*)  "ES
     (PMID:15625192): A weak interaction between Flag-tagged TIR-1 and
     T7-tagged *NSY-1* could be detected, but this interaction appeared
     inefficient compared to the TIR-1/UNC-43 interaction (data not
     shown)." ;
     [<http://purl.org/dc/elements/1.1/contributor>](http://purl.org/dc/elements/1.1/contributor)
     ["http://orcid.org/0000-0002-3013-9906"](http://orcid.org/0000-0002-3013-9906)  ;
     [<http://purl.org/dc/elements/1.1/date>](http://purl.org/dc/elements/1.1/date)   "2019-05-31" .

It seems, here we have a "directly positive regulates" Relation here (http://purl.obolibrary.org/obo/RO_0002629) between two entities (Source and Target) in the resp sentence in (rdf-schema#comment)

My question is a purely technical: How can I determine what these two entities Source and Target are (since the urls do not really point to a page that tells the browser what entities are behind these)? Of course in RDF a url does not necessary need to point to a resources that can be accessed as is - it can also be a DB identifier etc.

Being new to GO I'd appreciate a hint for this one detail: How do I arrive from the URIs of Source and Target (for source http://model.geneontology.org/568b0f9600000284/57ec3a7e00000079 and for target http://model.geneontology.org/568b0f9600000284/57ec3a7e00000109) at the specific entities that there URI refer to? Looking at the sentence above maybe the Source is "TIR-1" and the target "NYS-1"? Or the other way round? Or "UNC-43"? There is certainly a formal link between the urls and these entities in the text. But which link is that?

Ideally after identifying these entities we can use the GO-CAMS ttl data as further training data, meaning for the example above:

  • we have the original sentence (rdf-schema#comment)
  • we have the type of the relation ("directly positively regulates")
  • and we have the entities from the sentence (source and target) between which the relation holds and can infer/compute the precise positions in the sentences for these two entities.

Any hint regarding these questions would be highly appreciated.

@balhoff
Copy link
Member

balhoff commented Dec 12, 2023

Hi @zwurgl, for some reason the Turtle you pasted looks a little garbled. Here is that section from the file:

_:b40   <http://purl.org/pav/providedBy>  "http://www.wormbase.org" ;
        <http://geneontology.org/lego/evidence>  <http://model.geneontology.org/568b0f9600000284/5ce58dde00000278> ;
        <http://www.w3.org/2002/07/owl#annotatedProperty>  <http://purl.obolibrary.org/obo/RO_0002629> ;
        <http://www.w3.org/2002/07/owl#annotatedSource>  <http://model.geneontology.org/568b0f9600000284/57ec3a7e00000079> ;
        <http://www.w3.org/2002/07/owl#annotatedTarget>  <http://model.geneontology.org/568b0f9600000284/57ec3a7e00000109> ;
        <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://www.w3.org/2002/07/owl#Axiom> ;
        <http://www.w3.org/2000/01/rdf-schema#comment>  "ES (PMID:15625192): A weak interaction between Flag-tagged TIR-1 and T7-tagged NSY-1 could be detected, but this interaction appeared inefficient compared to the TIR-1/UNC-43 interaction (data not shown)." ;
        <http://purl.org/dc/elements/1.1/contributor>  "http://orcid.org/0000-0002-3013-9906" ;
        <http://purl.org/dc/elements/1.1/date>  "2019-05-31" .

For those two nodes, you can find other triples in the data. For example:

<http://model.geneontology.org/568b0f9600000284/57ec3a7e00000079> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://purl.obolibrary.org/obo/GO_0035591>

and

<http://model.geneontology.org/568b0f9600000284/57ec3a7e00000109> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <http://purl.obolibrary.org/obo/GO_0004709>

So it is an instance of signaling adaptor activity which is directly positively regulating an instance of MAP kinase kinase kinase activity. You would need to look at further graph connections to reach the gene products enabling these activities. It's all in there, but not that convenient maybe for someone not using OWL or RDF directly. We have been working on a report format which simplifies some of this, but it's not part of the public outputs yet (needs some refinement). The text there comes from the publication which the curator used to construct this part of the graph. It doesn't necessarily describe the exact edge we're looking at.

@zwurgl
Copy link
Author

zwurgl commented Dec 14, 2023

Thanks @balhoff for the explanation. That makes sense and allows me to explore the next steps. When you say "the text ... doesn't necessarily describe the exact edge we're looking at" that means that in general it cannot be assumed that the text part (the sentence etc) that did lead to this edge is contained in this dataset at all, right? So that limits of course the usefulness of the dataset for the purpose I had in mind (as useful as it certainly is for other purposes.) Thanks a lot again and will watch out for your report with the simplified data format :-)

@balhoff
Copy link
Member

balhoff commented Dec 14, 2023

@zwurgl I will ask @vanaukenk to comment about how the text in that comment ("ES (PMID:15625192): A weak interaction between Flag-tagged TIR-1 and T7-tagged NSY-1 could be detected, but this interaction appeared inefficient compared to the TIR-1/UNC-43 interaction (data not shown).") is added to the model. I don't think it is standard content that you will find connected to axioms in GO-CAMs. @vanaukenk, this is a worm model so I thought you might know where the text extract comes from.

@balhoff
Copy link
Member

balhoff commented Dec 14, 2023

@zwurgl I wonder if you could collect all the PMIDs from a particular model and associate text from those papers with the overall OWL representation in the GO-CAM. Would that be too indirect for your purposes?

@zwurgl
Copy link
Author

zwurgl commented Dec 15, 2023

@balhoff thanks. I will check. Taking your hint from above into account, what we have in this GoCams dataset then is: the source entity, the target entity, the type of relation and the PMID. So the next step for me is to identify source and target (or any one of their known variants or synonyms) in the text of the PMID (hopefully in the abstract) and get the positions (start / end). Then also stipulate a NO-RELATION between all other pairs of entities in that PMID abstract and we have a training corpus ... A bit of fiddling, but feasible.
The GO-CAMS download page lists besides the TTL data also data in SIF and a blazegraph DB. The SIF data doesn't seem to contain a reference to the underlying text. The blazegraph is huge, I will dive into that in due time whether that is useful as well.
Thanks again for your insights above!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants