Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: SqlImplementation.relationships_metadata() #659

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Sep 27, 2023

Updates

Optimization: SqlImplementation.relationships_metadata()

  • Add: SqlImplementation._axiom_annotations_all()
  • Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n)
  • Update: Added docstrings for several related methods

Addresses

@joeflack4 joeflack4 marked this pull request as draft September 27, 2023 01:16
@joeflack4 joeflack4 added the enhancement New feature or request label Sep 27, 2023
Copy link
Contributor Author

@joeflack4 joeflack4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmungall This draft is ready for your eyes.
You recently added ability to retrieve axiom annotations on relationships (#656), but there is a significant performance bottleneck. I noticed this during my mondo-ingest synchronization pipeline work. This operation was taking 104 minutes to complete.

I've sped this up to 30 seconds to 10 minutes, depending on which implementation you prefer. I prefer the faster method.

I also thought of a possible implementation that uses sqlite and/or SqlAlchemy moreso than Python, but Python is easier for me.

I know that before you can merge this, you need at the very least a test case and codestyle adherence. I will meet with @matentzn and see if he wants me to complete this PR. If not, then this is here for you / your team as an example of what could / should be done.

FYI even though there's no test case, I was able to check that the code works as expected in an ad hoc way, by editing/running this code in the local version of oaklib in the site-packages of mondo-ingest and using it in my synchronization PR.

@@ -1065,8 +1065,58 @@ def _rbox_relationships(
def relationships_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended implementation

n_relationships seconds_taken
500 21
5000 23
37129 27

I wasn't sure if you'd like this approach because it is heavier on memory during the function's runtime, but it seems to work well. I added a simple method ._axiom_annotations_all() which returns all annotations from the DB. Then the rest of the work is done in Python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if n_relationships > 1m?

Copy link
Contributor Author

@joeflack4 joeflack4 Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If n_relationships > 1m, then this further increases memory burden and that could be too significant for some machines. The higher the n is though, the better this implementation performs compared to the "alternative implementation", although there are other possible implementations, e.g. leaning more on sqlite as you say.

Copy link
Contributor Author

@joeflack4 joeflack4 Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in interest of keeping comments in threads / resolvable, replying to your summary comment here:

Your 2nd approach is on the right lines.

I thought you might feel that way. It's currently much less performant than this one, but you have an idea to speed that up:

I'd simplify; How about instead making a method like _axiom_annotations_multi that takes in a list of tuples, and let sqlite do the optimization?

I was thinking of that too, but it might take me significantly more time due to my lack of familiarity. I might not have that time, but I'll ask Nico; maybe I do. If I don't, I was wondering if you wouldn't mind finishing up what I've started?

I'm not sure we need two variations of ._axiom_annotations_multi(). Why don't we increase its performance? Here's my ideas so far:

Details: ._axiom_annotations_multi() optimizations

These would be modifications to semsql rather than oaklib, unless we went with (b) and tore down those mutations at the end of the function.

a. Indexing

Are columns subject, object, and predicate indexed? If not, I wonder if adding indexes might be all we need to make this performant.

b. (Temporary) lookup column

We can basically do the same thing as my first approach, but in SQL rather than Python. I am creating a lookup using RELATIONSHIPs as dictionary keys. We might do the same in sqlite, creating either/both columns that concat sub-pred-obj and sub-pred:

sub-pred-obj sub-obj sub pred obj MORE_COLS
M:123-is_a-M:456 M:123-M:456 M:123 is_a M:456 ...
M:123-is_a-M:456 M:123-M:456 M:123 is_a M:456 ...

Will have some duplicate rows due to multiple axioms per edge, but can still make for faster queries.

Alternatively, we could create a new table with only 1 row per edge, and collect all the axioms for that edge as a JSON serialization in a single column. That should be really fast.

for rel, anns in rel_anns.items():
yield rel, anns

def relationships_metadata2(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternate implementation

n_relationships seconds_taken
500 2
5000 53
37129 516

This approach splits relationships into groups based on predicates. Then it uses existing . _axiom_annotations_multi() method to fetch all annotations for each group of relationships.

# Fetch all annotations
anns: List[om.Annotation] = self._axiom_annotations_all()
# Simplify objects
anns: List[Dict[str, str]] = [{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor declutter refactor

This point needs the least of Chris' attention.

This function can be simplified if we move this conversion of "List[Dict[str, str]] (annotation objects) --> Dict[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]" from .relationships_metadata() to the private axiom methods ._axiom_annotations_multi() and my new ._axiom_annotations_all().

I did things this way because there is an inconsistency between ._axiom_annotations() (returns List[om.Annotation]) and ._axiom_annotations_multi() (returns Iterator[OwlAxiomAnnotation]), and so I wasn't sure which pattern I should adhere to.

@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2023

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (aab87c5) 76.63% compared to head (1bfe896) 76.56%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #659      +/-   ##
==========================================
- Coverage   76.63%   76.56%   -0.08%     
==========================================
  Files         251      251              
  Lines       29214    29250      +36     
==========================================
+ Hits        22389    22395       +6     
- Misses       6825     6855      +30     
Files Coverage Δ
...oaklib/implementations/sqldb/sql_implementation.py 79.77% <6.45%> (-1.70%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

) -> Iterator[Tuple[RELATIONSHIP, List[Tuple[PRED_CURIE, Any]]]]:
"""For given relationships, returns axiom annotation metadata"""
# Split based on predicate
preds: Set[PRED_CURIE] = {x[1] for x in relationships}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember the input is an Iterable, so if it's an iterator, not a collection, this will consume it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I added a line above adding rels and using that variable instead for the rest of the function.

rels: List[RELATIONSHIP] = [x for x in relationships]

anns = [(ann.predicate, ann.object) for ann in self._axiom_annotations(*rel)]
"""For given relationships, returns axiom annotation metadata"""
# Fetch all annotations
anns: List[om.Annotation] = self._axiom_annotations_all()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are making assumptions about the distributions of numbers of these. Consider a version of omop where every axiom has an annotation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By distribution, do you rather mean, "the potentially high number of annotations (returned)", as you query in this comment?:

what if n_relationships > 1m?

If that's what you mean, let's continue discussion in that thread and mark this one resolved. If you mean something else though, can you clarify?

rel_anns_all = {}
for pred in preds:
# Fetch annotations for given relationships
subs, objs = [x[0] for x in relationships], [x[2] for x in relationships]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you want to filter this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! This is something I didn't need for my use case so I added it after and hadn't tested it.

I added:

            # Filter rels
            rels = [x for x in relationships if x[1] == pred]

And changed references in that for loop to use rels instead of relationships

Copy link
Collaborator

@cmungall cmungall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the time taken here

Your 2nd approach is on the right lines.

I'd simplify

How about instead making a method like _axiom_annotations_multi that takes in a list of tuples, and let sqlite do the optimization?

@joeflack4 joeflack4 force-pushed the sql-relationships-metadata-performance branch from bc3a543 to 6b0be57 Compare September 27, 2023 22:48
- Add: SqlImplementation._axiom_annotations_all()
- Update: SqlImplementation.relationships_metadata(): Increased performance to O(log n)
- Update: Added docstrings for several related methods
@joeflack4 joeflack4 force-pushed the sql-relationships-metadata-performance branch from 6b0be57 to 1bfe896 Compare September 27, 2023 22:53
@matentzn
Copy link
Contributor

Nice I didnt realise you were working on this! Cool!

@joeflack4
Copy link
Contributor Author

@matentzn we discussed it a little bit at one of our recent meetings. I thought you had seen the PR. But yeah basically just got a figure out from you. If you think it's worth my time to complete this. It will speed up any of our scripts that use this function by about an hour or two.

I'm not actually sure now if we will even need this method for the synchronization pipeline now that I understand the work better.

Chris has a suggested implementation but I think it involves using SQL alchemy a bit more. However, I think the bulk of the time on this will take a lot of back and forth with me and Chris on the implementation. I don't know if you will like the idea that I have in mind, which would be creating an on-the-fly column , basically an index.

Alternative solution to that is to create additional indexes, but that would require an update to semantic SQL, not oak. And Chris hasn't gotten back to me yet on what he thinks about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization: SqlImplementation.relationships_metadata()
4 participants