Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract metadata from asset on remote server #1196

Closed
yarikoptic opened this issue Jan 30, 2023 · 35 comments · Fixed by #1214
Closed

Extract metadata from asset on remote server #1196

yarikoptic opened this issue Jan 30, 2023 · 35 comments · Fixed by #1214
Assignees

Comments

@yarikoptic
Copy link
Member

A satellite to dandi/dandi-archive#1450 . We need to workout a way to update metadata of draft dandisets, at least for NWB assets, without downloading them or going through datalad-fuse. PyNWB works fine with fsspec or via ROS3 driver to get remote access to the asset.

So if we dismiss for now the use case of NWB file belonging to BIDS (not DANDI layout dandiset) where there might be extra metadata needed to be extracted using BIDS utils, we need a helper given a dandiset ID and server instance (name)

  • verify that dandiset is not BIDS dandiset (that there is no dataset-description.json), if it is BIDS - skip
  • iterate through the assets and if an asset is .nwb file
    • extract metadata from the remote copy of the file by accessing via fsspec not local file
    • (optional, probably not needed really) check if resultant metadata record differs anyhow besides version of the dandi-cli which extracted it
    • mint a new asset for the prior blob id / path but with new metadata

Correct @satra ?

Note that we will pretty much double number of assets in the archive (now we have 240323), some of old ones would become orphans (we still do not have GC AFAIK) if not belonging to any published dandiset version (it seems we have only half of 170 non-empty dandisets published , ref: dandi/dandi-archive#1455)

@satra
Copy link
Member

satra commented Jan 30, 2023

that is correct. (the iterate part could use the glob API to only get nwb assets).

i was hoping to do this on the server side so we could run through in admin mode and be very careful that we are dealing with draft versions only. so a dry run option would probably be good to have.

@yarikoptic
Copy link
Member Author

here is a code snippet for how to access with pynwb remote asset on the archive https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html#streaming-method-2-fsspec via fsspec. @jwodder - could you please look into feasibility of enhancing Remote assets class for nwb files with ability to extract metadata from them?

@jwodder
Copy link
Member

jwodder commented Feb 1, 2023

@yarikoptic As far as I can tell, the biggest problem would be the fact that our metadata-computation routines require the input to be a local file in several places (in order to set blobDateModified, contentSize, and the digest).

@yarikoptic
Copy link
Member Author

interesting... so indeed in this case it is not just getting it from nwb via fsspect but we would need to reuse that metadata from the asset's metadata record.

@satra
Copy link
Member

satra commented Feb 1, 2023

if the checksum is the same, does a PUT request on the api server allow changing blobdatemodified/contentsize? cc:ing @AlmightyYakob .

from a function perspective one could write a shim that leverages those fields from the dandi asset metadata.

@jjnesbitt
Copy link
Member

if the checksum is the same, does a PUT request on the api server allow changing blobdatemodified/contentsize? cc:ing @AlmightyYakob .

from a function perspective one could write a shim that leverages those fields from the dandi asset metadata.

Regarding dateModified, if you're talking about the field within an asset's metadata, then you could make a simple PUT request, with the body containing exactly what was returned from a GET, with the dateModified field updated. If you're talking about the modified field on the asset model itself, that same approach would also work. However, you must update something, otherwise the asset won't change (i.e. if a PUT request on an asset is made with exactly the same data as exists already, nothing will be done). One thing to note is that in any case, the asset_id will change, since we never update assets, we just create new ones.

For contentSize, it's different, since that's derived from the blob/zarr itself. So if you wanted to update that, you'd actually need to upload a new file with content of that size, and update the asset to point to that new asset blob.

Does that answer your question?

@satra
Copy link
Member

satra commented Feb 1, 2023

@AlmightyYakob - in the server code there are some metadata fields that are not allowed to be modified by certain specific requests. so if i don't change the digest, i'm simply asking how many of the fields related to the digest are not allowed to be modified by a PUT request. an example here: https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/models/asset.py#L298 . we should take the rest of this discussion offline, so this one can focus on the task at hand (which your suggestion should be fine to do).

@yarikoptic
Copy link
Member Author

@jwodder - could you please work out a code snippet, it doesn't need to be formalized within RemoteAsset and could be nwb specific for now, so it just gets metadata for the asset, re-extracts metadata from its URL on s3 using fsspec, submits updated metadata (mints a new asset).

@jwodder
Copy link
Member

jwodder commented Feb 3, 2023

@yarikoptic I don't think this can be done in just a "snippet." Too much of the current code assumes we have a local file path, and we'd have to either rewrite a bunch of stuff or just duplicate a bunch of code with a few changes.

@yarikoptic
Copy link
Member Author

Indeed lots of code intertwined to rely on having a file system. Looking at the code some notes, without yet deciding on which way to proceed:

  • We already have https://github.com/dandi/dandi-cli/blob/HEAD/tools/update-assets-on-server circa July 2021
    • that is when we still had a local backup of all assets, so it relies on having access to a blob file to 1. get digest, 2. call nwb2asset.
    • so we had already an 'ad-hoc' script which can no longer be used. Proves the point that we need a somewhat thought through solution
  • nwb2asset (calling into get_metadata which switches logic for different file types/bids or not) and is at large would need to:
    • https://github.com/dandi/dandi-cli/blob/HEAD/dandi/metadata.py#L124 : we are making attempts to import 2 possible known extensions (ndx_dandi_icephys and allensdk.brain_observatory.ecephys.nwb) if _get_pynwb_metadata fails to open file/load metadata. It might no longer be needed since eaaa19e we use load_namespaces=True and I believe this should be sufficient to just load the metadata
    • we can adjust _get_pynwb_metadata to handle path which if URL - uses fsspec
    • similarly adjust get_neurodata_types
    • either update get_nwb_version or not use/update (just take known) nwb_version field
    • overall feels that some abstraction on top of NWBHDF5IO and h5py.File to open them via fsspec if URL would help here
      • but it all might interfere with the use of .memoize_path of our fscacher which relies on the argument to be a file. I don't remember what it does (just passes though?) if can't stat file on the drive (as it would be in case of URL)
  • add_common_metadata needs access to file for blobDateModified, contentSize, and encodingFormat. Here we might need some ad-hoc handling indeed to either copy those from an existing metadata record or have some abstraction level which would switch between local/remote files.
  • overall it does feel like we could provide some minimal abstraction of File -> LocalFile (not that we do not have those already LocalFileAsset -> NWBAsset), RemoteFile and have some
    • .open context manager to abstract opening file via fsspec or just passing filepath down
    • basic interfaces in use to get size, date modified
    • make all those functions use this abstraction instead of plain str for path.

@jwodder
Copy link
Member

jwodder commented Feb 13, 2023

I believe the fsspec file classes implement all or most of Python's standard IO methods, so we might be able to adjust the relevant functions that currently take file paths to instead just take open file objects. This still wouldn't address getting the file size, mtime, and MIME type, though.

@yarikoptic
Copy link
Member Author

that is why I wondered if it better by our custom adapter classe(s) which would use fsspec classes for IO methods and appropriate ones for size, mtime.

@jwodder
Copy link
Member

jwodder commented Feb 14, 2023

@yarikoptic What exactly is the next step here? Is it to devise a File/LocalFile/RemoteFile hierarchy?

@yarikoptic
Copy link
Member Author

I think so -- devise the hierarchy and use it through out the code used to extract metadata.

@jwodder
Copy link
Member

jwodder commented Feb 14, 2023

@yarikoptic Possible API:

class File(ABC):
    @abstractmethod
    def open(self) -> AbstractContextManager[IO[bytes]]:
        ...

    @abstractmethod
    def get_size(self) -> int:
        ...

    @abstractmethod
    def get_mtime(self) -> datetime:
        ...

    @abstractmethod
    def get_filename(self) -> str:
        # Returns just the base filename; needed for determining MIME type based on file extensions
        ...


class LocalFile(File):
    filepath: Path

    def __init__(self, path: str | Path) -> None:
        ...

    # Methods use the obvious local implementations


class RemoteFile(File):
    url: str
    # Possibly add an additional optional parameter for the filename for when it can't be parsed from the URL

    def __init__(self, url: str) -> None:
        ...

    # Methods use fsspec

In addition, we give BaseRemoteBlobAsset an as_file() -> RemoteFile method (possibly with keyword parameters that are forwarded to get_content_url()), and we give LocalFileAsset an as_file() -> LocalFile method.

  • We do not add an as_file() method to BaseRemoteAsset or LocalAsset, as Zarrs cannot be represented as files or URLs.
  • Question: Should the File hierarchy and the as_file() methods be publicly exposed, or should they only be used internally by dandi-cli for metadata extraction purposes?
  • Do we even want the implementation of RemoteFile to exist in dandi-cli (as opposed to a separate script), as adding fsspec to the project's dependencies would just drag in more stuff that's rarely used?
    • Perhaps we could make the fsspec dependency an extra?

@yarikoptic
Copy link
Member Author

Sounds good. Some notes:

  • I think we better make/use RemoteFile more DANDI specific, as RemoteDANDIAsset.__init__(self, url:str) to obtain size and mtime from metadata since mtime from URL would correspond to time that blob was uploaded to AWS, which would differ, where url is the URL to the file in some instance of the DANDI archive, as the one we parse parse_dandi_url, or otherwise we would need to provide instance_name, (dandiset, path) or asset_id.
  • Re Question: ATM I see them only for internal extraction BUT I think it would be a great addition for public API, at large removing the need for some boiler plate code in https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html#getting-the-location-of-the-file-on-dandi I believe. So if you do not see a problem with it -- would be nice to expose IMHO.
  • Indeed I would not make fsspec a requirement -- I would indeed just make fsspec an extra dependency (in extras which has duecredit ATM) and delay import until that singular code location where we use it.

@jwodder
Copy link
Member

jwodder commented Feb 14, 2023

@yarikoptic To be clear: In your first point, you're saying that RemoteFile should instead be called RemoteDANDIAsset (seems too easy to confuse with other *Asset classes) and that its constructor should fetch the size & mtime from the asset metadata?

@yarikoptic
Copy link
Member Author

yes. I do appreciate possible confusion point, thought even to see if we could simply extend API of our existing hierarchy if we could just mix-in this functionality.

@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@yarikoptic I'd rather have these as separate classes rather than "polluting" the BaseRemoteBlobAsset and LocalFileAsset namespaces with the relevant methods (In particular, LocalFileAsset already has size and modified properties, so if we went the "extra methods on extant classes" route, we'd want to give BaseRemoteBlobAsset properties of the same name as well, but those can easily fail due to HTTP errors or the metadata simply not having the relevant fields (not sure how possible that one is), so they semantically make more sense as methods than properties.)

As for a less confusing name for RemoteDANDIAsset, how about RemoteAssetFile?

@yarikoptic
Copy link
Member Author

Sounds good

@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@yarikoptic I presume that remote files should be opened using the code shown here.

  • Should the fsspec file system object be a caching file system or just a plain HTTP file system? If caching, where should the cache be?
  • How should creation of the file system object be managed? Should there be one global lazily-constructed file system used by all assets, or should each asset get its own instance, or something in between?
  • Is there any preference about which content URL to use for reading assets?
  • If the size and mtime cannot be retrieved from the asset metadata (because the contentSize/blobDateModified field was absent/malformed — not sure how possible that is), should the get_size() and get_mtime() methods fall back to using the Content-Length and Date-Modified HTTP headers of the URL (and, if so, what if those are absent/malformed) or something else?

@yarikoptic
Copy link
Member Author

  • I would not bother with caching for now - it might also get used by dandi-archive so caching might be trickier etc. If we see the need later, we can add it (later).
  • without caching, we do not need to explicitly bother creating a filesystem instance. Can just use with fsspec.open(url, 'r') as f: context manager .
  • the one directly on S3
  • hm, please log warning for now if size is missing, and indeed fetch from S3 key -- should be the same ;) as if blobDateModified is missing - let it be missing, it seems to be optional according to https://github.com/dandi/dandi-schema/blob/71a8ea3f8231cc8bf4fadcc51ea76db86171925f/dandischema/models.py#L1081

@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@yarikoptic So, if blobDateModified is missing, what should be returned by get_mtime() when add_common_metadata() needs a "modified" time?

@yarikoptic
Copy link
Member Author

what about:

  • get_mtime() returns None whenever it doesn't know
  • add_common_metadata() doesn't assign blobDateModified at all if returned mtime is None

@jwodder
Copy link
Member

jwodder commented Feb 16, 2023

@yarikoptic What should we do about get_metadata(), get_neurodata_types(), and nwb_has_external_links(), which use fscacher, which assumes that the first argument is something that can be passed to os.realpath()?

Also, do we want get_metadata() to still support being called with a directory path? That doesn't seem to be used anywhere.

@yarikoptic
Copy link
Member Author

@yarikoptic What should we do about get_metadata(), get_neurodata_types(), and nwb_has_external_links(), which use fscacher, which assumes that the first argument is something that can be passed to os.realpath()?

ideally we create some local adapter to trigger fscacher wrapped function if it is a local path instance and proceed without fscacher otherwise. But I do not see ATM an easy solution there ... I am afraid we need to improve fscacher to be able to handle some non-str objects which have some magical methods (e.g. .as_path) or smth as non-pretty like that. hm. May be you have better ideas?

Meanwhile - just disable those decorators I guess :-/

re get_metadata

  • can't it obtain path to .zarr ?
  • I thought it should have also worked if we pointed it to dandiset, to load from dandiset.yaml but it seems just to not work at all:
❯ dandi -l error ls ../dandisets/000003/dandiset.yaml
- errors: 1
  path: ../dandisets/000003/dandiset.yaml
  size: 3614

❯ dandi -l error ls ../dandisets/000003/
- path: ../dandisets/000003/

so -- my concern is primarily .zarr directory.

@jwodder
Copy link
Member

jwodder commented Feb 17, 2023

@yarikoptic

  • My idea is to update fscacher to wrap the os.realpath() call in try, and if a TypeError is raised, we bypass the cache.
  • get_metadata() doesn't do anything with Zarrs.

@yarikoptic
Copy link
Member Author

@yarikoptic

  • My idea is to update fscacher to wrap the os.realpath() call in try, and if a TypeError is raised, we bypass the cache.

But that alone wouldn't be sufficient to work with File object, in effect making fscaching defunct on local files too, or am I missing something?

@yarikoptic
Copy link
Member Author

May be if it str that argument first, and we ensure that remote construct has str which is very unlikely to correspond to local path?

@jwodder
Copy link
Member

jwodder commented Feb 17, 2023

@yarikoptic Alternatively, we give LocalFile an __fspath__ method so that it can be passed to os.realpath().

@yarikoptic
Copy link
Member Author

Oh cool! Can even subclass os.PathLike so can be explicitly isinstance checked

@jwodder
Copy link
Member

jwodder commented Feb 17, 2023

@yarikoptic I believe the builtin functions that work with paths accept anything that implements __fspath__ without requiring them to actually subclass PathLike; see "path-like object" in the glossary and the docs for os.fspath(), which make no mention of PathLike. Hence, instead of using isinstance, fscacher should just check for a TypeError.

@yarikoptic
Copy link
Member Author

cool, indeed talks about "protocol" there and not inheritance. So overall -- sounds great.

@jwodder
Copy link
Member

jwodder commented Feb 20, 2023

@yarikoptic Due to some changes I made while implementing this, I'm currently getting type-checking errors involving _sanitize_nwb_version(). This function currently accepts strs and ints, stringifying the latter, but any other type is returned as-is (with a warning logged). This messes up type-checking, as there does not seem to be a guarantee that the input (coming from h5file.attrs["nwb_version"]) is in fact a string. Would it be acceptable to stringify inputs of non-str, non-int types? And similarly change the lambda in this line to stringify v?

@yarikoptic
Copy link
Member Author

sorry for the delay. https://github.com/NeurodataWithoutBorders/nwb-schema/blob/HEAD/core/nwb.file.yaml#L9 says it should be text. So, assuming that we are getting away from the wild west of prior versions where it was specified more loosely, yeah - let's just stringify there. I don't think we use get_nwb_version with sanitize=True anywhere, so we might as well deprecate that option while at it.

yarikoptic added a commit that referenced this issue Feb 28, 2023
Support re-extracting metadata from remote assets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants