Synchronization: SubClassOf #363

joeflack4 · 2023-09-10T23:57:38Z

Updates

Synchronization: SubClassOf
Pipeline to keep subclass relationships in sync between Mondo and sources.
Currently handled: Case 1, 3, 5

Add: makefile goals: sync, sync-subclassof, and more.
Update: makefile goal: build-mondo-ingest
Add: src/scripts/sync_subclassof.py
Add: Outputs to reports/

General

Bugfix: utils.py remove_angle_brackets(): Now correctly returns a str if receives a str.
Update: config/prefixes.csv: Add: New entries
Add: utils.py: get_monarch_curies_converter()
Update: Python requirements: oaklib upgrade
Update: Minor refactor in _get_all_owned_terms()
Update: .gitignore: entries, ordering, and comments

Related to other PRs

Update: metadata/icd10who.yml: The 2 prefix maps in this file were inconsistent with each other. Corrected here and also in ICD10WHO prefix updates #377

src/ontology/mondo-ingest.Makefile

src/scripts/sync_subclassof.py

src/scripts/utils.py

src/ontology/config/prefixes.csv

src/scripts/sync_subclassof.py

src/ontology/config/prefixes.csv

src/scripts/sync_subclassof.py

matentzn

Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):

Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.
TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.
Use of Global Variables: There's extensive use of global variables (e.g., EX_DEFAULTS, ONTOLOGY_DIR, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change.
Function Length: Some functions are quite long and perform multiple tasks (e.g., sync_subclassof). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability.
Hardcoded Values: There are hardcoded values and paths (like EX_ONTO_NAME = 'ordo'). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse.
Unused Code: There are commented-out sections and potentially unused variables (e.g., # rels_indirect_source_source = ...). It's important to remove unused code to maintain clarity.

src/ontology/reports/out_of_sync_mondo_ids.tsv

tests/input/merged.owl

src/scripts/sync_subclassof.py

matentzn · 2023-12-21T11:47:09Z

src/scripts/sync_subclassof.py

+    sync_subclassof(**d)
+
+
+def run_defaults(use_cache=True):  # todo: #remove-temp-defaults


what is the point of this method? I don't see it used anywhere and I would rather we don't add any code that is not needed..

Default settings @matentzn

Review and decide.

Details

Basically there are some toggleable default settings that can be run when running the Python script directly rather than in the makefile.

This is very helpful during development. It helps me quickly run and debug things. And there are some things I can do with this defaults setup that I can't do using a debug config with saved params.

We've discussed this type of thing in the past and you've indicated that you prefer to leave out such defaults code from the final product. That's what this todo is. It's an indicator that this stuff should be removed when development is done.

Of course, this feature is too big for a single PR, and there are perhaps it will undergo development for many months, so I'd like to leave the defaults there so that I can toggle their usage when I develop further.

Note that these are off by default. So they will have no effect unless I'm doing local development.

Options:
a. Delete them now; Will slow Joe down when he continues to develop in the future, but he can temporarily add them back in another PR.
b. Leave them for now. Delete at some point in the future when other cases are completed.
c. Simply leave them indefinitely into the future. They are no harm in production, and useful for further development and debugging.

Thank you for the detailed analysis. This is TMD for me, I will trust you to make the right call!

src/scripts/sync_subclassof.py

joeflack4 · 2023-12-29T01:22:53Z

src/ontology/mondo-ingest.Makefile

@@ -498,6 +499,41 @@ slurp-all-no-updates: $(foreach n,$(ALL_COMPONENT_IDS), slurp-no-updates-$(n))
 .PHONY: slurp-all
 slurp-all: $(foreach n,$(ALL_COMPONENT_IDS), slurp-$(n))

+
+#############################
+###### Synchronization ######


General review comment discussion

@matentzn Just responding to your general / review closing comment as a thread instead so that it can be marked resolved

Context (Nico's review comment):

Details

Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):

Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.

TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.

Use of Global Variables: There's extensive use of global variables (e.g., EX_DEFAULTS, ONTOLOGY_DIR, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change.

Function Length: Some functions are quite long and perform multiple tasks (e.g., sync_subclassof). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability.

Hardcoded Values: There are hardcoded values and paths (like EX_ONTO_NAME = 'ordo'). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse.

Unused Code: There are commented-out sections and potentially unused variables (e.g., # rels_indirect_source_source = ...). It's important to remove unused code to maintain clarity.

Joe's responses:

Docstrings and Comments

The only long one is in _convert_edge_namespace():

failed_lookups: Each entry in this set can be 1 of 2 cases: (a) the edge exists in one ontology, represented by the namespace of IDs in rels & the keys in ns_map, but it does not exist in the mapped otology, represented by the namespace of IDs in rels2 & the values in ns_map; it may simply be that there are some terms in the 1st ontology that do not exist in the 2nd ontology, or (b) the edges should be in both ontologies, but if the inputs to this pipeline are out of sync, the mapped terms did not appear in ns_map.

Probably could be a bit more concise. But generally, I am fond of detailed docstrings. If you look at popular libraries, they often have very long docstrings. I don't mind them because they can be collapsed more easily, and there is really nowhere better to put them. You could put in external docs, but that's a hassle because it's not linked to the code, and it also just pollutes the docs with stuff that's not going to be relevant to most people. I think erring on the side of verbose is good. Generally people only read these on the rare occasion where they're digging deeply for something, and when it comes to digging deeply, you want to have access to as much information as possible.

todo comments

I have a convention where I distinguish between TODO and todo. I use TODO for something that is important; e.g. it is necessary for completing the development objective, or it is related to a bug. I use todo for things like optimizations, refactoring, additional testing, and ideas.

Everything that's left here is a todo, so not a major worry.

Also I should note that I differ from other developers in that I am in favor of leaving (lots of) todos even in production. For these low priority things, there are basically 3 choices:
a. Don't leave anything written about them
b. Put in a GitHub issue
c. Leave in the code
d. Leave written elsewhere

For 'b'. Basically, I don't think everything is important enough to go into a GitHub issue, though perhaps these kind of things could go into 1+ issues marked such as "low priority misc".

I think 'd' is a bad idea.

I think 'a' is not a good idea. Similar to my advocation of long docstrings, I think it is better when more is said than not.

So I advocate for 'c'.

Global variables

Not confident on the best solutions but I offer my thoughts here.
Mostly all of the global variables are used only within the global dictionary EX_DEFAULTS. The design decision for this being a global variable, even though it's really only used in cli(), is because it is essentially a kind of configuration, configs are typically placed at the top of a file by convention for easy access.

~~I would advocate against moving to a separate file in this case because this config is specific to the sync subclass script and no other files.~~ (edit: Changed; see update below)

I suppose it is actually not great that you can't tell just looking at the code whether or not the globals are re-used in a lot of places. I am not a fan of global variables, myself.

A potential solution is to create a config() or defaults() function and place everything in there. I don't like it aesthetically, but perhaps it is better.

I'd advocate against environmental variables because this makes collaboration more difficult. Even for me, I program on multiple machines sometime, and keeping these files in sync would add additional hassle for me.

Also there can probably be a little cleanup here. Some of these vars could be removed.

Update: I created a config file and moved these variables there. Reason I did that was because I split the CLI into 2 files, and so this change became necessary. Further DRYification is also possible, actually. Many of these path variables already exist in utils.py and are being imported by other scripts. I should have utilized that before. But I didn't refactor further at this time because (i) really those variables should be in a config.py anyway, (ii) those variables are str and not Path objects, so it would require more refactoring, decision making, and possibly testing to make that change.

Function length

Yeah I think sync_subclassof() is really the only long one. Actually I was thinking it needed to be broken down too. It's actually only about 200 lines, and about 50% of that is comments, so it is not actually that long. But it is pretty complex.

I've been thinking about making it smaller, and this was the idea I had. Maybe you agree?: Each of the "cases" (about 20 lines each) gets its own function.

Beyond that, I think it would be difficult to make this much smaller, though I do have a todo related to doing an OO refactor that would simplify things a bit / reduce lines here, but it would not be a quick change.

Hardcoded Values

I put EX_ to stand for "example". The concerns you bring up here, I have addressed in this comment.

Unused Code

I generally agree with this, and I have a todo about this at the top of the block of unused code. Much of this code would become relevant should we do some additional cases. But I can stash this code in my notes somewhere.

Thank you for your thoughtful comments! I wanted you to be aware of my thoughts, and trust you with the execution! I am quite liberal, maybe to the border of anarchical, when it comes to code, and @twhetzel may be implementing stricter code standards moving forward. This was the last PR where I set the tone of the review, so I am happy if you are.

joeflack4 · 2024-01-06T01:16:33Z

src/scripts/sync_subclassof.py

+
+    # Case 3: SCR is direct in the source, but not at all in Mondo
+    subheader = deepcopy(ROBOT_SUBHEADER)
+    subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf'


Dual bugfix on case 3 'added' template

Addresses:

object_mondo_id is currently SC % in non-obsolete template

AI missing from MONDO:excluded_subClassOf

~~I did the bug fixes but I did not have the time to re-run the outputs with my limited time this weekend and with the urgent work on fixing the pipeline.~~

I reran and updated the outputs.

Suggested change

subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf'

subheader[0]['object_mondo_id'] = 'AI obo:mondo#excluded_subClassOf'

This was not the exactly correct IRI, but this one is (I also tested it).

Oh, very good! I just changed it now via amended commit.

By "tested it", I take it that this means that a CURIE is fine here; doesn't need to be a URI.

Do you want to do the honors for merging?

Trish and I decided today it was time to merge this!

Pipeline to keep subclass relationships in sync between Mondo and sources. Currently handled: Case 1, 3, 5 - Add: makefile goals: sync, sync-subclassof, and more. - Update: makefile goal: build-mondo-ingest - Add: src/scripts/sync_subclassof.py - Add: Outputs to reports/ General - Bugfix: utils.py remove_angle_brackets(): Now correctly returns a str if receives a str. - Update: config/prefixes.csv: Add: New entries - Add: utils.py: get_monarch_curies_converter() - Update: Python requirements: oaklib upgrade - Update: Minor refactor in _get_all_owned_terms() - Update: .gitignore: entries, ordering, and comments Related to other PRs - Update: metadata/icd10who.yml: The 2 prefix maps in this file were inconsistent with each other. Corrected here and also in #377

joeflack4 marked this pull request as draft September 10, 2023 23:58

joeflack4 self-assigned this Sep 10, 2023

joeflack4 linked an issue Sep 10, 2023 that may be closed by this pull request

Synchronization: subClassOf axioms (MVP) #92

Closed

13 tasks

joeflack4 added the enhancement New feature or request label Sep 10, 2023

joeflack4 commented Sep 10, 2023

View reviewed changes

src/ontology/mondo-ingest.Makefile Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from 5b9873e to bf07365 Compare September 11, 2023 00:47

joeflack4 commented Sep 11, 2023

View reviewed changes

src/ontology/mondo-ingest.Makefile Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from bf07365 to 771c2dc Compare September 11, 2023 00:58

joeflack4 commented Sep 11, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 32979c1 to 763e959 Compare September 16, 2023 00:23

joeflack4 commented Sep 16, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch 3 times, most recently from c36db42 to b156653 Compare September 16, 2023 22:17

joeflack4 commented Sep 16, 2023

View reviewed changes

src/scripts/utils.py Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from b156653 to f92a5c2 Compare September 16, 2023 23:06

joeflack4 commented Sep 16, 2023

View reviewed changes

src/ontology/config/prefixes.csv Outdated Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from f92a5c2 to 64bdfe6 Compare September 16, 2023 23:19

joeflack4 commented Sep 16, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

src/ontology/config/prefixes.csv Outdated Show resolved Hide resolved

joeflack4 commented Sep 16, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 commented Sep 17, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 commented Sep 17, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch 2 times, most recently from bef16dd to 4336a2f Compare September 17, 2023 01:58

joeflack4 commented Sep 17, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from 4336a2f to 927936c Compare September 17, 2023 02:27

This comment was marked as outdated.

Sign in to view

joeflack4 force-pushed the sync-subclass branch 8 times, most recently from 893c817 to 0d206cd Compare December 19, 2023 03:00

joeflack4 commented Dec 19, 2023

View reviewed changes

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved

src/scripts/sync_subclassof.py Show resolved Hide resolved

joeflack4 commented Dec 19, 2023

View reviewed changes

src/scripts/sync_subclassof.py Show resolved Hide resolved

src/scripts/sync_subclassof.py Show resolved Hide resolved

joeflack4 force-pushed the sync-subclass branch from 0d206cd to 3518287 Compare December 19, 2023 03:04

matentzn reviewed Dec 21, 2023

View reviewed changes

joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 0015e59 to 348f5c5 Compare December 29, 2023 01:00

joeflack4 commented Dec 29, 2023

View reviewed changes

joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 8f56a9a to 01b3033 Compare January 5, 2024 17:13

joeflack4 force-pushed the sync-subclass branch 2 times, most recently from 3a90496 to ba99144 Compare January 6, 2024 01:09

joeflack4 commented Jan 6, 2024

View reviewed changes

joeflack4 force-pushed the sync-subclass branch from ba99144 to 8e9ca4b Compare January 6, 2024 03:07

joeflack4 force-pushed the sync-subclass branch from 8e9ca4b to 1169129 Compare January 6, 2024 23:03

joeflack4 mentioned this pull request Jan 11, 2024

Synchronization: subClassOf axioms (misc follow-ups) #392

Open

3 tasks

joeflack4 merged commit edbf75a into main Jan 12, 2024

joeflack4 deleted the sync-subclass branch January 12, 2024 00:10

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization: SubClassOf #363

Synchronization: SubClassOf #363

joeflack4 commented Sep 10, 2023 •

edited

Loading

This comment was marked as outdated.

matentzn left a comment

matentzn Dec 21, 2023

joeflack4 Dec 29, 2023

matentzn Jan 2, 2024

joeflack4 Dec 29, 2023 •

edited

Loading

matentzn Jan 2, 2024

joeflack4 Jan 6, 2024 •

edited

Loading

matentzn Jan 6, 2024

joeflack4 Jan 6, 2024 •

edited

Loading

joeflack4 Jan 12, 2024

		sync_subclassof(**d)


		def run_defaults(use_cache=True): # todo: #remove-temp-defaults

	subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf'
	subheader[0]['object_mondo_id'] = 'AI obo:mondo#excluded_subClassOf'

Synchronization: SubClassOf #363

Synchronization: SubClassOf #363

Conversation

joeflack4 commented Sep 10, 2023 • edited Loading

Updates

This comment was marked as outdated.

matentzn left a comment

Choose a reason for hiding this comment

matentzn Dec 21, 2023

Choose a reason for hiding this comment

joeflack4 Dec 29, 2023

Choose a reason for hiding this comment

Default settings @matentzn

matentzn Jan 2, 2024

Choose a reason for hiding this comment

joeflack4 Dec 29, 2023 • edited Loading

Choose a reason for hiding this comment

General review comment discussion

matentzn Jan 2, 2024

Choose a reason for hiding this comment

joeflack4 Jan 6, 2024 • edited Loading

Choose a reason for hiding this comment

Dual bugfix on case 3 'added' template

matentzn Jan 6, 2024

Choose a reason for hiding this comment

joeflack4 Jan 6, 2024 • edited Loading

Choose a reason for hiding this comment

joeflack4 Jan 12, 2024

Choose a reason for hiding this comment

joeflack4 commented Sep 10, 2023 •

edited

Loading

joeflack4 Dec 29, 2023 •

edited

Loading

joeflack4 Jan 6, 2024 •

edited

Loading

joeflack4 Jan 6, 2024 •

edited

Loading