Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function for converting NWB file to AssetMeta instance #226

Merged
merged 27 commits into from
Dec 2, 2020
Merged

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Aug 27, 2020

This PR requires PR #206.

@satra Is this what you had in mind?

@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #226 (c6c5329) into master (194ce14) will decrease coverage by 0.21%.
The diff coverage is 82.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #226      +/-   ##
==========================================
- Coverage   82.26%   82.05%   -0.22%     
==========================================
  Files          55       56       +1     
  Lines        4623     5020     +397     
==========================================
+ Hits         3803     4119     +316     
- Misses        820      901      +81     
Flag Coverage Δ
unittests 82.05% <82.85%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/tests/fixtures.py 96.49% <ø> (ø)
dandi/metadata.py 74.17% <78.24%> (+25.39%) ⬆️
dandi/validate.py 81.25% <84.21%> (-4.47%) ⬇️
dandi/models.py 81.25% <100.00%> (+0.52%) ⬆️
dandi/pynwb_utils.py 87.00% <100.00%> (+0.84%) ⬆️
dandi/tests/test_metadata.py 100.00% <100.00%> (ø)
dandi/tests/test_validate.py 100.00% <100.00%> (ø)
dandi/utils.py 76.20% <0.00%> (+0.37%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 194ce14...c6c5329. Read the comment docs.

@satra
Copy link
Member

satra commented Aug 27, 2020

@jwodder - kind of. unless we change get_metadata to already do the mapping, you will likely need a transform or function that computes the asset field given the metadata. so perhaps something like: setattr(asset, field, get_asset_field(metadata, field))

def get_asset_field(metadata, field):
    if field == "wasDerivedFrom": 
        biosample = Biosample.unvalidated()
        for biosample_field in biosample.__fields__:
            setattr(biosample, biosample_field, get_asset_field(metadata, biosample_field))
        biosample = Biosample(**biosample)
    if field == "age":
        age = metadata["age"]
        # convert age to ISO 8601
        # create property value
        return property_value
    ...

and then after it is done you can check validation by: AssetMeta(**asset.dict())

@yarikoptic
Copy link
Member

FWIW: I merged #206 into master.

@yarikoptic
Copy link
Member

So, overall flow I think should be

NWB -1-> metadata dict (unharmonized) -2-> AssetMeta

-1-> is pretty much our get_metadata but we might need to extend it with getting more fields from nwb, so that dict would be sufficient for -2->.

I think having an explicit -1-> (instead of just feeding -2-> with the nwb file) would be useful for initial harmonization if later we start also dealing with other file types (neuroimaging etc).

Somewhat of a "tricky" part is that e.g. Digests we estimate during upload (not on the fly yet, like we already do in download) and enhance metadata record with them. So we should indeed allow for -2-> without full validation right away -- or better with validation but allowing some fields to not be defined (such as Digests) before we estimate them.

As for -2-> -- could be indeed the function as @satra suggested, but as we briefly touched upon yesterday, it might be worth right away to set them up within our model classes? e.g. instead of

contentSize: str = Field(nskey="schema")
path: str = Field(None, nskey="dandi")

have smth like

contentSize: str = Field(nskey="schema", adapter="size")
path: str = Field(None, nskey="dandi", adapter="path")

for basic 1-to-1 mappings, but also allow for callables like

def adapter_age(metadata):
        age = metadata["age"]
        # convert age to ISO 8601
        # create property value
        return property_value

and then have

age: Optional[PropertyValue] = Field(...allwhatithasnow, adapter=adapter_age)

Then for DandiBaseModel to define class method def from_metadata(cls, metadata, allow_missing=None) which would just iterate over cls attributes, use specified adapters to compose, validate (allowing for missing) and return the structure. While iterating, it would not fail right away if any KeyError is raised, but collect a set of them to raise at the end with a single informative exception telling which keys (and may be for which attributes) are still required.

NB. from_metadata could have avoided taking care about validation altogether, and we just provide additional helper to validate allowing for missing attributes, but then we would not be able to raise that informative message about which metadata fields are needed, which IMHO would be quite useful especially now while we are just establishing the mappings etc.

Re BioSample: When looping it would discover that it is of (or a list of) type which is subclass of DandiBaseModel (like BioSample is) it would just delegate to that cls.from_metadata(metadata, allow_missing=None).

  • may be "allow_missing" will need to allow for hierarchical specification to be able to pass into such nested calls, but I do not think it is needed ATM - I know only Digests

@satra @jwodder - does above make sense? ;)

@satra
Copy link
Member

satra commented Aug 28, 2020

makes sense to me. the only reason i would suggest the standalone function for now instead of an adapter is that we may want to push nwb to align as closely as possible in the future.

and if ever we move to attrs from pydantic, it already has a notion of converter. the only reason we are using pydantic is to stay close to jsonschema and fastapi for now.

@jwodder
Copy link
Member Author

jwodder commented Aug 31, 2020

I have enhanced the code according to the above suggestions. However, I would appreciate it if someone could provide a comprehensive example of NWB metadata (as extracted with get_metadata()) and the AssetMeta that it should be converted to so that I know that I'm handling & converting everything correctly.

@satra
Copy link
Member

satra commented Aug 31, 2020

@jwodder - a starting point would be to extract the nwb metadata for a few files from each of the datasets we have access to rather than something completely comprehensive. to save time, i think @yarikoptic was going to do this on his backup system.

@jwodder
Copy link
Member Author

jwodder commented Aug 31, 2020

@satra I'm interested in the AssetMeta, not the base metadata. For example, given a dict with all of the following fields:

metadata_nwb_file_fields = (
"experiment_description",
"experimenter",
"identifier", # note: required arg2 of NWBFile
"institution",
"keywords",
"lab",
"related_publications",
"session_description", # note: required arg1 of NWBFile
"session_id",
"session_start_time",
)
metadata_nwb_subject_fields = (
"age",
"date_of_birth",
"genotype",
"sex",
"species",
"subject_id",
)
metadata_nwb_dandi_fields = ("cell_id", "slice_id", "tissue_sample_id", "probe_ids")
metadata_nwb_computed_fields = (
"number_of_electrodes",
"number_of_units",
"nwb_version",
"nd_types",
)
metadata_nwb_fields = (
metadata_nwb_file_fields
+ metadata_nwb_subject_fields
+ metadata_nwb_dandi_fields
+ metadata_nwb_computed_fields
)

... where should everything end up in the resulting AssetMeta? Which fields need special parsing? Etc.

@satra
Copy link
Member

satra commented Aug 31, 2020

@jwodder - you kind of have to go field by field and ask is this information somewhere in the base metadata. and remember some fields are themselves structures - like biosample.

easiest place to start is likely:

metadata_nwb_subject_fields = (

most of which go into biosample.

we may also need to augment the model to allow for other things like number of electrodes, etc.,. there is no specific place for this in the model.

other aspects such as experimenter could be mapped to contributor with role Researcher.

I think most of the fields have some description, but feel free to ask if something is unclear.

@jwodder
Copy link
Member Author

jwodder commented Aug 31, 2020

@satra Specific questions, then:

  • How are TypeModel classes like SexType supposed to be used? What should be identifier, and what should be name?
  • What sort of input values are expected for assayType and anatomy, and how do they map to lists of values?
  • How should a plain age like "5 years" be converted to a PropertyValue? Would it just be PropertyValue(value="P5Y")?
  • What should be done for the various required/defaultless fields like BioSample.assayType and AssetMeta.dataType that have no corresponding field in get_metadata()?

@satra
Copy link
Member

satra commented Aug 31, 2020

How are TypeModel classes like SexType supposed to be used? What should be identifier, and what should be name?

identifier should be the corresponding url from an ontology, name would be the label. more generally we can translate a few of those to enum types as we have done here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87

What sort of input values are expected for assayType and anatomy, and how do they map to lists of values?

assayType should come from OBI (which is a large ontology for biomedical investigation). the question here is how to map nwb "datatypes" to this.

How should a plain age like "5 years" be converted to a PropertyValue? Would it just be PropertyValue(value="P5Y")?

correct, with unitText set to "Years from brith" for now or for gestational weeks "Weeks from conception".

What should be done for the various required/defaultless fields like BioSample.assayType and AssetMeta.dataType that have no corresponding field in get_metadata()?

let's ask a few folks for hints.

@tgbugs - we are back to where do we find enumerations for all kinds of things and @bendichter we are looking at how we get people to put the right info into nwb or get out of existing nwb (which does not have any ontology attached yet). the specific considerations here are:

all types are all listed here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L87

and here:

https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L159

@bendichter
Copy link
Member

@satra

I think modality and measurementTechnique would mostly come from which neurodata_types are being used in the NWB file. For instance,

If ElectricalSeries is used, it's "electrophysiology." If the file contains an IntracellularElectrode, I guess that's also electrophysiology? Does the ontology differentiate between intracellular and extracellular electrophysiology? NWB does.

If Position is used, it's a "behavioral approach."

If TwoPhotonSeries is used, it's a "cell population imaging."

Many experiments are combinations of these. Does this schema allow multiple modalities/measurementTechniques?

@bendichter
Copy link
Member

Would you be able to provide an example of assayType or dataType? I'm not sure if NWB stores this information.

@yarikoptic
Copy link
Member

Does this schema allow multiple modalities/measurementTechniques?

yes, https://github.com/dandi/dandi-cli/blob/master/dandi/models.py#L574 -- all those are List:

    modality: List[ModalityType] = Field(readonly=True, nskey="dandi")
    measurementTechnique: List[MeasurementTechniqueType] = Field(
        readonly=True, nskey="schema"
    )
    variableMeasured: List[PropertyValue] = Field(readonly=True, nskey="schema")

I think we have decided to postpone dealing with assets which have no useful data recorded (thus not optional - requires at least a single entry in the list), and I guess it might help to facilitate us to provide heuristics for all which we care about ATM ;) (since cannot just leave empty if we find no relevant one)

@tgbugs
Copy link

tgbugs commented Aug 31, 2020

All the techniques and modalities are in InterLex now. @tmsincomb have you had a chance to ingest the subClassOf for the modalities (the modelling of the techniques isn't amenable to ingesting those right at the moment)?

@tmsincomb
Copy link

@tgbugs
I start up the superclass update for modality. I was going to fix the update first, but I'll do that now to avoid issues.

@jwodder
Copy link
Member Author

jwodder commented Aug 31, 2020

@satra I've added a test case for conversion of a metadata dict to an AssetMeta to dandi/tests/test_metadata.py. Please take a look at it and tell me if it does what you want.

@tmsincomb
Copy link

@tgbugs The modalities superclasses are in now.

@tgbugs
Copy link

tgbugs commented Aug 31, 2020

@tmsincomb thanks! The subclassOf closure for modalities (experimental approaches) is visible here.

@tgbugs
Copy link

tgbugs commented Aug 31, 2020

@bendichter at the technique level the ontology distinguishes between intracellular and extracellular.

@yarikoptic
Copy link
Member

FWIW: I have submitted information to get an RRID for dandi-cli. I will report back whenever it gets registered

@yarikoptic
Copy link
Member

I think it might be good time to also introduce here RF to validate.py (and possibly cmd_validate.py) to use this new schema/function. I guess a good start is validate_dandi_nwb. Then new functionality could easily be tested on a sample of real files from dandiarchive dandisets.

* upstream/master: (116 commits)
  ENH: basic duecredit support
  DOC: strip away duplicate with the handbook information
  Update CHANGELOG.md [skip ci]
  Tweak pubschemata commit message
  Adjust workflow path triggers
  Add comment to numpy API warning ignore setting
  Workflow for publishing model schemata to dandi/schema
  [#275] Ignore numpy API warning
  Support h5py 3.0
  Add healthchecks for the Postgres and minio Docker containers
  Include item path in "Multiple files found for item" message
  Copy files with `cp --reflink=auto` where supported
  Test the rest of keyring_lookup()
  More keyring_lookup() tests
  Test askyesno()
  Test that keyring backends specified with env vars take precedence over config files
  Test keyring_lookup() when backend is set via config file
  Test asking for an API key twice via input()
  Test keyring_lookup() when backend is set via env var
  Basic test of getting an API key via input()
  ...
@yarikoptic yarikoptic mentioned this pull request Dec 1, 2020
satra and others added 3 commits December 1, 2020 10:27
* upstream/master:
  Update CHANGELOG.md [skip ci]
  change from disease to disorder
  Fix publish-schemata workflow
  updated just models
  BF: add h5py.__version__ into the list of tokens for caching
dandi/metadata.py Outdated Show resolved Hide resolved
@satra
Copy link
Member

satra commented Dec 1, 2020

@jwodder @yarikoptic - do you think we should add the dandiset metadata mapper from old version to new in this PR or a different PR?

@yarikoptic
Copy link
Member

This PR already tunes validate_dandi_nwb, thus if released we would start enforcing schema upon users trying to validate/upload. I am not sure if we are ready for that yet... we could add temporary treatment of DANDI_SCHEMA env variable and if defined -- use new logic, if not -- old. This would allow then to merge and even let curious folks to use it, and eventually we would just remove that treatment when "we are ready". WDYT?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's right away make mapping a bit more "robust" by at least catching ambiguous items.

Alternative would be:

  • establish some schema_mappings.yaml which would be hosted within this repo and contain exact values we encounter in data to be mapped to the corresponding term. We will update it whenever we encounter new loose values
  • make dandi-cli fetch updated version once in a while (or forcefully if requested) and cache it locally on user's drive to be used instead of the copy shipped within. This way we could let people update mapping without updating dandi-cli (although since its releases are quite automated now, we could even not bother about this feature for now)

edit: let's forget about .yaml file idea, probably would not be scalable etc, so IMHO we should keep it a .py, but may be RF later on so we could indeed fetch the freshier "mapping" file to be used instead of the shipped one.

dandi/metadata.py Outdated Show resolved Hide resolved
dandi/metadata.py Outdated Show resolved Hide resolved
satra and others added 3 commits December 1, 2020 21:19
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@satra
Copy link
Member

satra commented Dec 2, 2020

i have added the converter for dandiset metadata as well now.

@yarikoptic
Copy link
Member

@jwodder please

@jwodder
Copy link
Member Author

jwodder commented Dec 2, 2020

@yarikoptic I've satisfied the linters and added an if os.environ.get("DANDI_SCHEMA"): block to validate_dandi_nwb() to control whether to use the behavior from before or after this PR.

@yarikoptic yarikoptic added enhancement New feature or request minor Increment the minor version when merged labels Dec 2, 2020
@yarikoptic
Copy link
Member

Great, thank you @jwodder -- let's proceed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants