-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract metadata from asset on remote server #1196
Comments
that is correct. (the iterate part could use the glob API to only get nwb assets). i was hoping to do this on the server side so we could run through in admin mode and be very careful that we are dealing with draft versions only. so a dry run option would probably be good to have. |
here is a code snippet for how to access with pynwb remote asset on the archive https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html#streaming-method-2-fsspec via fsspec. @jwodder - could you please look into feasibility of enhancing Remote assets class for nwb files with ability to extract metadata from them? |
@yarikoptic As far as I can tell, the biggest problem would be the fact that our metadata-computation routines require the input to be a local file in several places (in order to set |
interesting... so indeed in this case it is not just getting it from nwb via fsspect but we would need to reuse that metadata from the asset's metadata record. |
if the checksum is the same, does a PUT request on the api server allow changing blobdatemodified/contentsize? cc:ing @AlmightyYakob . from a function perspective one could write a shim that leverages those fields from the dandi asset metadata. |
Regarding For Does that answer your question? |
@AlmightyYakob - in the server code there are some metadata fields that are not allowed to be modified by certain specific requests. so if i don't change the digest, i'm simply asking how many of the fields related to the digest are not allowed to be modified by a PUT request. an example here: https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/models/asset.py#L298 . we should take the rest of this discussion offline, so this one can focus on the task at hand (which your suggestion should be fine to do). |
@jwodder - could you please work out a code snippet, it doesn't need to be formalized within RemoteAsset and could be nwb specific for now, so it just gets metadata for the asset, re-extracts metadata from its URL on s3 using fsspec, submits updated metadata (mints a new asset). |
@yarikoptic I don't think this can be done in just a "snippet." Too much of the current code assumes we have a local file path, and we'd have to either rewrite a bunch of stuff or just duplicate a bunch of code with a few changes. |
Indeed lots of code intertwined to rely on having a file system. Looking at the code some notes, without yet deciding on which way to proceed:
|
I believe the fsspec file classes implement all or most of Python's standard IO methods, so we might be able to adjust the relevant functions that currently take file paths to instead just take open file objects. This still wouldn't address getting the file size, mtime, and MIME type, though. |
that is why I wondered if it better by our custom adapter classe(s) which would use fsspec classes for IO methods and appropriate ones for size, mtime. |
@yarikoptic What exactly is the next step here? Is it to devise a |
I think so -- devise the hierarchy and use it through out the code used to extract metadata. |
@yarikoptic Possible API:
In addition, we give
|
Sounds good. Some notes:
|
@yarikoptic To be clear: In your first point, you're saying that |
yes. I do appreciate possible confusion point, thought even to see if we could simply extend API of our existing hierarchy if we could just mix-in this functionality. |
@yarikoptic I'd rather have these as separate classes rather than "polluting" the As for a less confusing name for |
Sounds good |
@yarikoptic I presume that remote files should be opened using the code shown here.
|
|
@yarikoptic So, if |
what about:
|
@yarikoptic What should we do about Also, do we want |
ideally we create some local adapter to trigger fscacher wrapped function if it is a local path instance and proceed without fscacher otherwise. But I do not see ATM an easy solution there ... I am afraid we need to improve Meanwhile - just disable those decorators I guess :-/ re
so -- my concern is primarily .zarr directory. |
|
But that alone wouldn't be sufficient to work with File object, in effect making fscaching defunct on local files too, or am I missing something? |
May be if it |
@yarikoptic Alternatively, we give |
Oh cool! Can even subclass os.PathLike¶ so can be explicitly isinstance checked |
@yarikoptic I believe the builtin functions that work with paths accept anything that implements |
cool, indeed talks about "protocol" there and not inheritance. So overall -- sounds great. |
@yarikoptic Due to some changes I made while implementing this, I'm currently getting type-checking errors involving |
sorry for the delay. https://github.com/NeurodataWithoutBorders/nwb-schema/blob/HEAD/core/nwb.file.yaml#L9 says it should be |
Support re-extracting metadata from remote assets
A satellite to dandi/dandi-archive#1450 . We need to workout a way to update metadata of draft dandisets, at least for NWB assets, without downloading them or going through datalad-fuse. PyNWB works fine with fsspec or via ROS3 driver to get remote access to the asset.
So if we dismiss for now the use case of NWB file belonging to BIDS (not DANDI layout dandiset) where there might be extra metadata needed to be extracted using BIDS utils, we need a helper given a dandiset ID and server instance (name)
Correct @satra ?
Note that we will pretty much double number of assets in the archive (now we have 240323), some of old ones would become orphans (we still do not have GC AFAIK) if not belonging to any published dandiset version (it seems we have only half of 170 non-empty dandisets published , ref: dandi/dandi-archive#1455)
The text was updated successfully, but these errors were encountered: