Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_label* issues #212

Open
joeflack4 opened this issue Jul 27, 2022 · 6 comments
Open

get_label* issues #212

joeflack4 opened this issue Jul 27, 2022 · 6 comments

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented Jul 27, 2022

Summary

I have a list of terms and I'm trying to get their labels using, but I got errors in some cases, and no label in others.

1-3: Using SqlImplementation

1. No get_label_by_uri() or get_labels_by_uris()

I appreciate the CURIE methods, and will likely use those more often; I'd imagine these URI ones are somewhere on the to-do list.

2. get_label_by_curie() errors out

Python code:

        labels = []
        for curie in list(df_i['term_id']):
            label = si.get_label_by_curie(curie)
            labels.append(label)

Short err: sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database
Long err:

/Users/joeflack4/virtualenvs/mondo-ingest/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 49437 --file /Users/joeflack4/projects/mondo-ingest/src/analysis/excluded_terms_in_mondo.py -o reports/excluded_terms_in_mondo_xrefs.tsv
Connected to pydev debugger (build 211.7628.24)
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.DatabaseError: file is not a database

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database
[SQL: SELECT value FROM rdfs_label_statement WHERE subject = ?]
[parameters: ('ICD10CM:H18.1',)]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

3. get_labels_for_curies() errors out & prints out a lot to stderr

Code: labels = [x for x in si.get_labels_for_curies(list(df_i['term_id']))]

Error:

 ...'UMLS:C0024588', 'SCTID:720344007', 'OMIM:254500', 'OMIM:156530', 'MESH:D010300', 'ICD10WHO:H53.31')]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

It showed a lot more than this; more than my IDE's terminal could show. I couldn't see the actual stacktrace. Also looked like it printed the full list of terms in the ontology.

### 4. UX: get_labels_for_curies() returns a generator
I think the documentation shows specifically how to use it, but I wonder if this is the best design. I may just not be sure about the various benefits, but I'd definitely prefer if this defaulted to a (a) flat list, or (b) dict of CURIE -> label.

Edit: This is a design choice.

Additional information

How I'm initializing:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

Relevant dependency versions:

oaklib==0.1.34
sssom==0.3.11
sssom-schema==0.9.4

All dependency versions: mondo-ingest-pip-freeze.txt

4. Using ProntoImplementation got no labels() when passing CURIE

Example class

    <owl:Class rdf:about="http://www.orpha.net/ORDO/Orphanet_10">
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_377789"/>
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_557493"/>
        <obo:IAO_0000115 xml:lang="en">A rare sex chromosome ... </obo:IAO_0000115>
        ...
        <rdfs:label>48,XXYY syndrome</rdfs:label>
    </owl:Class>

Code snippet

from oaklib.resource import OntologyResource
from oaklib.implementations import ProntoImplementation

ontology = SqlImplementation(OntologyResource(slug='ordo.owl', local=True))
ontology.label('ORDO:10')  # None
ontology.label('http://www.orpha.net/ORDO/Orphanet_10')  # 48,XXYY syndrome

Suggestions

  1. Allow user to pass prefix_map when initializing ontology. This may solve cases where it can't find on CURIE, because it can't expand CURIE properly.
  2. Update ProntoImplementation def type definitions, e.g. label(self, curie: CURIE) -> str: and def _entity(self, curie: CURIE, strict=False): to Union[CURIE, URI]. In my test case, URI worked successfully.
@cmungall
Copy link
Collaborator

cmungall commented Aug 9, 2022

Thanks @joeflack4! It looks like the problem here has nothing to do with get_labels, the problem is further upstream when connecting to a sqlite database.

@joeflack4
Copy link
Contributor Author

@cmungall For 2 of these sub-issues, I agree with you that it is a DB connection issue.

I just added this to the bottom of my issue. This is how I'm initializing my usage of OAK:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

I just noticed that the front page of the OAK GitHub seems to indicate that I should be passing a sqlite .db file into OAK. I hope this isn't the case. Do I need to pre-create a .db file from OWL first? I hope not. I was under the impression that OAK would do this for me.

If a .db file is required, can we have SqlImplementation raise an error if something other than a .db is passed into the local param? I can make a quick PR for that if you like.

@cmungall
Copy link
Collaborator

cmungall commented Aug 9, 2022

Thanks, this helps.

We will work on improving the documentation for initialization of SqlImplementation. Yes, the canonical way to do this is to pass in a path to a ready-made sqlite database. See SqlImplementation.

However, you are correct in that if you pass in an OWL file, it will build the sqlite database for you. There are a few caveats here.

  • You will need to have the necessary tools installed for building the database. Currently this is a bit of a faff, but the plan is to have a pure python implementation: Make everything installable via PyPI semantic-sql#41
  • The input MUST be RDF/XML, because that is what one of the dependent tools (rdftab) expects. This will be resolved by the above issue
  • It may be hard to debug things that go wrong with the build. As part of your debugging process I would try:
    1. Try making the sqlite database outside of OAK first (I think you know how to do this from your mondo work)
    2. Check the database is built correctly by running a few queries on it
    3. If you can pass this .db file directly as input and you get the same errors, you know the problem is downstream. Otherwise it is upstream, and we can analyze the problem and fix it

I'm sorry, I missed the comment about iterators/generators before. This is indeed independent of the sqlite issues you are facing. The use of iterators is an intentional choice. We have a very preliminary FAQ entry about this, we will expand it.

The reason to use iterators is because OAK is designed for ontologies large and small, for endpoints local and remote, for big queries and small queries. Using an iterator allows the client to start doing useful work on initial results before all results have been computed.

There is indeed a tradeoff here, in that it requires slightly more defensive programming, and it's easy to make mistakes like len(oi.some_method(...)). If you really need to operate on the whole list, just wrap in list(...). But for most operations you can just operate on the iterator as if it were a list, e.g in for loops

I made a separate issue for this #221

@joeflack4
Copy link
Contributor Author

joeflack4 commented Aug 9, 2022

Ok, sorry to take up your times w/ the iterators concern. I struck-through that part of the issue.

Hmmm... no I haven't created a SQL DB in my Mondo work / from RDF or OWL yet, and I mostly have access to files in OWL, not RDF/XML. This will require a bit more work on my end. I'll converse with Nico and see how we want to use OAK in light of this.

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

Thanks for your very detailed explanations.

@cmungall
Copy link
Collaborator

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

I partially agree - I think the principle here is that it should raise an error if the input is something that it is unable to handle, and error reporting should be more intuitive. I made a separate issue for this: #223

However, I don't think it's good to unwaveringly assume a relationship between suffix and format/model

  • There are dozens of RDF formats. .rdf could mean turtle, it could mean rdf/xml
  • .owl is also a reasonable suffix for any RDF syntax file that is intended to be interpreted as OWL
  • there are all kinds of reasons why someone may make a file with no suffix or some other kind of suffix, they may in fact have no control

So while it's good to guess-with-consent based on suffix, it must be forgiving of going against convention

@joeflack4
Copy link
Contributor Author

@cmungall I have updated this issue with another problem case. To make best use of your time, please look at new suggestions at the bottom of the original post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants