Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

Closed
yarikoptic opened this issue Feb 9, 2021 · 3 comments · Fixed by #452

Comments

@yarikoptic
Copy link
Member

In interactions with girder server, besides metadata record, we had uploaded uploaded_mtime record which was used to

  • quickly assess if file on remote is potentially different (if taken with size) from what is on the server, so we could have a large collection of files locally and quickly decide either any is needed to be uploaded
  • quickly judge either file on the server is older than the one we have locally, and upload only a newer one in "--existing=refresh" mode

With new dandi-api server we do not upload mtime as part of the AssetMeta, which precludes both of the aforementioned modes of operation. Possible ways forward

1: introduce modifiedTime to assetMeta. I dislike this since it is highly volatile and per se not metadata of the asset as stored within the file. May be there is some "last modified" time stamp in nwb but then it would make it nwb specific and thus also not good and thus be avoided.
2: we could introduce that "objectId" from nwb into metadata somewhere which would allow assess if file is changed somewhat quickly under assumption that objectId should be changed by any modification: well -- direct hdf5 manipulations would not do that. again - nwb specific.
3: keep model as is, forget about "refresh" mode, and to avoid lengthy re-digesting of every file we would need to memoize result of computing digest per each path. Should generally work, be slow only on initial sensing/upload of the files (to actually estimate digest). But I feel a bit unsettled about memoizing checksums, not sure if should be default (with option to disable) or some explicit --fast mode.
4: alternative to easy to do memoization of (only) checksums - per each "local" copy of a dandiset keep information about each uploaded file (the same mtime, inode, size which are used during memoization) and thus be able later to check those mtimes. But IMHO too much hassle and does not help with "racy" uploads from multiple locations

So I do not see any generically nice way keep dandi upload as usable on large volumes of data and help to avoid useless re-uploads (the same file re-uploaded back and forth from two different locations) etc. 3 is probably the most reasonable, but would not provide "refresh" mode.

Any additional ideas @satra ?

@satra
Copy link
Member

satra commented Feb 9, 2021

mtime can be destroyed quite easily, but indeed available in many settings.

i think 3 is your best bet. the issue is not with upload but local checksum computation. since the digest allows for multiple checksums, perhaps use the fastest digest algorithm as well and store that. to verify if a blob is uploaded you will always need a local store of the sha256.

thus technically that + assetmetadata should allow you to check if the asset is up there. but i don't know how the check would happen.

@satra
Copy link
Member

satra commented Feb 9, 2021

perhaps use something like this: https://cyan4973.github.io/xxHash/

@yarikoptic
Copy link
Member Author

Is not about hash speed per se, but rather needing to go through all the data to compute it - that gets too much too quickly. Also it would be the one we have in metadata on server, which is sha256 ATM. Oh well, I guess 3 is the way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants