upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

yarikoptic · 2021-02-09T01:12:50Z

In interactions with girder server, besides metadata record, we had uploaded uploaded_mtime record which was used to

quickly assess if file on remote is potentially different (if taken with size) from what is on the server, so we could have a large collection of files locally and quickly decide either any is needed to be uploaded
quickly judge either file on the server is older than the one we have locally, and upload only a newer one in "--existing=refresh" mode

With new dandi-api server we do not upload mtime as part of the AssetMeta, which precludes both of the aforementioned modes of operation. Possible ways forward

1: introduce modifiedTime to assetMeta. I dislike this since it is highly volatile and per se not metadata of the asset as stored within the file. May be there is some "last modified" time stamp in nwb but then it would make it nwb specific and thus also not good and thus be avoided.
2: we could introduce that "objectId" from nwb into metadata somewhere which would allow assess if file is changed somewhat quickly under assumption that objectId should be changed by any modification: well -- direct hdf5 manipulations would not do that. again - nwb specific.
3: keep model as is, forget about "refresh" mode, and to avoid lengthy re-digesting of every file we would need to memoize result of computing digest per each path. Should generally work, be slow only on initial sensing/upload of the files (to actually estimate digest). But I feel a bit unsettled about memoizing checksums, not sure if should be default (with option to disable) or some explicit --fast mode.
4: alternative to easy to do memoization of (only) checksums - per each "local" copy of a dandiset keep information about each uploaded file (the same mtime, inode, size which are used during memoization) and thus be able later to check those mtimes. But IMHO too much hassle and does not help with "racy" uploads from multiple locations

So I do not see any generically nice way keep dandi upload as usable on large volumes of data and help to avoid useless re-uploads (the same file re-uploaded back and forth from two different locations) etc. 3 is probably the most reasonable, but would not provide "refresh" mode.

Any additional ideas @satra ?

The text was updated successfully, but these errors were encountered:

satra · 2021-02-09T01:53:32Z

mtime can be destroyed quite easily, but indeed available in many settings.

i think 3 is your best bet. the issue is not with upload but local checksum computation. since the digest allows for multiple checksums, perhaps use the fastest digest algorithm as well and store that. to verify if a blob is uploaded you will always need a local store of the sha256.

thus technically that + assetmetadata should allow you to check if the asset is up there. but i don't know how the check would happen.

satra · 2021-02-09T01:56:25Z

perhaps use something like this: https://cyan4973.github.io/xxHash/

yarikoptic · 2021-02-09T02:28:19Z

Is not about hash speed per se, but rather needing to go through all the data to compute it - that gets too much too quickly. Also it would be the one we have in metadata on server, which is sha256 ATM. Oh well, I guess 3 is the way

yarikoptic mentioned this issue Feb 16, 2021

upload (dandi-api): we keep replacing an asset which is already in the archive #387

Closed

yarikoptic mentioned this issue Mar 3, 2021

schema: add date modified to BareAssetMeta #448

Closed

jwodder mentioned this issue Mar 4, 2021

Add dateModified to asset metadata #452

Merged

yarikoptic closed this as completed in #452 Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

yarikoptic commented Feb 9, 2021

satra commented Feb 9, 2021

satra commented Feb 9, 2021

yarikoptic commented Feb 9, 2021

upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

upload: dandi-api - no mtimes are uploaded, so we can neither provide "refresh" nor fast check #367

Comments

yarikoptic commented Feb 9, 2021

satra commented Feb 9, 2021

satra commented Feb 9, 2021

yarikoptic commented Feb 9, 2021