`get_label*` issues #212

joeflack4 · 2022-07-27T02:36:33Z

Summary

I have a list of terms and I'm trying to get their labels using, but I got errors in some cases, and no label in others.

1-3: Using `SqlImplementation`

1. No `get_label_by_uri()` or `get_labels_by_uris()`

I appreciate the CURIE methods, and will likely use those more often; I'd imagine these URI ones are somewhere on the to-do list.

2. `get_label_by_curie()` errors out

Python code:

        labels = []
        for curie in list(df_i['term_id']):
            label = si.get_label_by_curie(curie)
            labels.append(label)

Short err: sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database
Long err:

/Users/joeflack4/virtualenvs/mondo-ingest/bin/python /Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 49437 --file /Users/joeflack4/projects/mondo-ingest/src/analysis/excluded_terms_in_mondo.py -o reports/excluded_terms_in_mondo_xrefs.tsv
Connected to pydev debugger (build 211.7628.24)
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.DatabaseError: file is not a database

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/joeflack4/virtualenvs/mondo-ingest/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) file is not a database
[SQL: SELECT value FROM rdfs_label_statement WHERE subject = ?]
[parameters: ('ICD10CM:H18.1',)]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

3. `get_labels_for_curies()` errors out & prints out a lot to `stderr`

Code: labels = [x for x in si.get_labels_for_curies(list(df_i['term_id']))]

Error:

 ...'UMLS:C0024588', 'SCTID:720344007', 'OMIM:254500', 'OMIM:156530', 'MESH:D010300', 'ICD10WHO:H53.31')]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

It showed a lot more than this; more than my IDE's terminal could show. I couldn't see the actual stacktrace. Also looked like it printed the full list of terms in the ontology.

~~### 4. UX: get_labels_for_curies() returns a generator~~
I think the documentation shows specifically how to use it, but I wonder if this is the best design. I may just not be sure about the various benefits, but I'd definitely prefer if this defaulted to a (a) flat list, or (b) dict of CURIE -> label.

Edit: This is a design choice.

Additional information

How I'm initializing:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

Relevant dependency versions:

oaklib==0.1.34
sssom==0.3.11
sssom-schema==0.9.4

All dependency versions: mondo-ingest-pip-freeze.txt

4. Using `ProntoImplementation` got no labels() when passing CURIE

Example class

    <owl:Class rdf:about="http://www.orpha.net/ORDO/Orphanet_10">
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_377789"/>
        <rdfs:subClassOf rdf:resource="http://www.orpha.net/ORDO/Orphanet_557493"/>
        <obo:IAO_0000115 xml:lang="en">A rare sex chromosome ... </obo:IAO_0000115>
        ...
        <rdfs:label>48,XXYY syndrome</rdfs:label>
    </owl:Class>

Code snippet

from oaklib.resource import OntologyResource
from oaklib.implementations import ProntoImplementation

ontology = SqlImplementation(OntologyResource(slug='ordo.owl', local=True))
ontology.label('ORDO:10')  # None
ontology.label('http://www.orpha.net/ORDO/Orphanet_10')  # 48,XXYY syndrome

Suggestions

Allow user to pass prefix_map when initializing ontology. This may solve cases where it can't find on CURIE, because it can't expand CURIE properly.
Update ProntoImplementation def type definitions, e.g. label(self, curie: CURIE) -> str: and def _entity(self, curie: CURIE, strict=False): to Union[CURIE, URI]. In my test case, URI worked successfully.

The text was updated successfully, but these errors were encountered:

cmungall · 2022-08-09T15:59:37Z

Thanks @joeflack4! It looks like the problem here has nothing to do with get_labels, the problem is further upstream when connecting to a sqlite database.

joeflack4 · 2022-08-09T18:56:06Z

@cmungall For 2 of these sub-issues, I agree with you that it is a DB connection issue.

I just added this to the bottom of my issue. This is how I'm initializing my usage of OAK:

from oaklib.resource import OntologyResource
from oaklib.implementations.sqldb.sql_implementation import SqlImplementation

resource = OntologyResource(slug='myOwlOntology.owl', local=True)
si = SqlImplementation(resource)

I just noticed that the front page of the OAK GitHub seems to indicate that I should be passing a sqlite .db file into OAK. I hope this isn't the case. Do I need to pre-create a .db file from OWL first? I hope not. I was under the impression that OAK would do this for me.

If a .db file is required, can we have SqlImplementation raise an error if something other than a .db is passed into the local param? I can make a quick PR for that if you like.

cmungall · 2022-08-09T19:23:59Z

Thanks, this helps.

We will work on improving the documentation for initialization of SqlImplementation. Yes, the canonical way to do this is to pass in a path to a ready-made sqlite database. See SqlImplementation.

However, you are correct in that if you pass in an OWL file, it will build the sqlite database for you. There are a few caveats here.

You will need to have the necessary tools installed for building the database. Currently this is a bit of a faff, but the plan is to have a pure python implementation: Make everything installable via PyPI semantic-sql#41
The input MUST be RDF/XML, because that is what one of the dependent tools (rdftab) expects. This will be resolved by the above issue
It may be hard to debug things that go wrong with the build. As part of your debugging process I would try:
1. Try making the sqlite database outside of OAK first (I think you know how to do this from your mondo work)
2. Check the database is built correctly by running a few queries on it
3. If you can pass this .db file directly as input and you get the same errors, you know the problem is downstream. Otherwise it is upstream, and we can analyze the problem and fix it

I'm sorry, I missed the comment about iterators/generators before. This is indeed independent of the sqlite issues you are facing. The use of iterators is an intentional choice. We have a very preliminary FAQ entry about this, we will expand it.

The reason to use iterators is because OAK is designed for ontologies large and small, for endpoints local and remote, for big queries and small queries. Using an iterator allows the client to start doing useful work on initial results before all results have been computed.

There is indeed a tradeoff here, in that it requires slightly more defensive programming, and it's easy to make mistakes like len(oi.some_method(...)). If you really need to operate on the whole list, just wrap in list(...). But for most operations you can just operate on the iterator as if it were a list, e.g in for loops

I made a separate issue for this #221

joeflack4 · 2022-08-09T19:38:16Z

Ok, sorry to take up your times w/ the iterators concern. I struck-through that part of the issue.

Hmmm... no I haven't created a SQL DB in my Mondo work / from RDF or OWL yet, and I mostly have access to files in OWL, not RDF/XML. This will require a bit more work on my end. I'll converse with Nico and see how we want to use OAK in light of this.

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

Thanks for your very detailed explanations.

cmungall · 2022-08-10T00:41:53Z

Just adding again that I think SqlImplementation should raise an error then if it is initialized from a file with extension other than .db or .rdf (or some other means of determining this without looking at the file extension).

I partially agree - I think the principle here is that it should raise an error if the input is something that it is unable to handle, and error reporting should be more intuitive. I made a separate issue for this: #223

However, I don't think it's good to unwaveringly assume a relationship between suffix and format/model

There are dozens of RDF formats. .rdf could mean turtle, it could mean rdf/xml
.owl is also a reasonable suffix for any RDF syntax file that is intended to be interpreted as OWL
there are all kinds of reasons why someone may make a file with no suffix or some other kind of suffix, they may in fact have no control

So while it's good to guess-with-consent based on suffix, it must be forgiving of going against convention

joeflack4 · 2022-09-30T02:04:27Z

@cmungall I have updated this issue with another problem case. To make best use of your time, please look at new suggestions at the bottom of the original post.

joeflack4 mentioned this issue Jul 27, 2022

PR: Which terms outside the current term scopes are xref'd? monarch-initiative/mondo-ingest#35

Merged

cmungall mentioned this issue Aug 10, 2022

Provide more informative error messages if the input for SqlImplementation is not a valid sqlite file #223

Open

joeflack4 mentioned this issue Sep 30, 2022

Slurp pipeline: Initial design monarch-initiative/mondo-ingest#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_label*` issues #212

`get_label*` issues #212

joeflack4 commented Jul 27, 2022 •

edited

Loading

cmungall commented Aug 9, 2022

joeflack4 commented Aug 9, 2022

cmungall commented Aug 9, 2022 •

edited

Loading

joeflack4 commented Aug 9, 2022 •

edited

Loading

cmungall commented Aug 10, 2022

joeflack4 commented Sep 30, 2022

get_label* issues #212

get_label* issues #212

Comments

joeflack4 commented Jul 27, 2022 • edited Loading

Summary

1-3: Using SqlImplementation

1. No get_label_by_uri() or get_labels_by_uris()

2. get_label_by_curie() errors out

3. get_labels_for_curies() errors out & prints out a lot to stderr

Additional information

4. Using ProntoImplementation got no labels() when passing CURIE

Suggestions

cmungall commented Aug 9, 2022

joeflack4 commented Aug 9, 2022

cmungall commented Aug 9, 2022 • edited Loading

joeflack4 commented Aug 9, 2022 • edited Loading

cmungall commented Aug 10, 2022

joeflack4 commented Sep 30, 2022

`get_label*` issues #212

`get_label*` issues #212

joeflack4 commented Jul 27, 2022 •

edited

Loading

1-3: Using `SqlImplementation`

1. No `get_label_by_uri()` or `get_labels_by_uris()`

2. `get_label_by_curie()` errors out

3. `get_labels_for_curies()` errors out & prints out a lot to `stderr`

4. Using `ProntoImplementation` got no labels() when passing CURIE

cmungall commented Aug 9, 2022 •

edited

Loading

joeflack4 commented Aug 9, 2022 •

edited

Loading