Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video files organize #841

Merged
merged 76 commits into from
Feb 8, 2022
Merged

Conversation

Saksham20
Copy link
Contributor

@Saksham20 Saksham20 commented Nov 24, 2021

fixes #785

This PR adds the functionality to upload on DANDI additional files of different extensions that are be linked to the nwbfile with the external_file argument.

Prior to upload this requires editing the nwbfile's external_file argument to reflect the new location of the video in the DANDI re-organized folder structure as it would be on the cloud. This would keep external links consistent for another user that downloads that dandiset. This is a multi-step process: we need a strategy to organise the video files like Dandi organise, edit the nwbfile and then upload this.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Saksham20!

Very preliminary review to possible help align ongoing development. Extra overall

  • I think we need docstrings for all those functions since even though name is descriptive somewhat, the purpose and what is returned is not clear
    • also please use type annotations
  • Need to come up with unittest(s)

dandi/organize.py Outdated Show resolved Hide resolved
dandi/organize.py Outdated Show resolved Hide resolved
dandi/consts.py Outdated Show resolved Hide resolved
dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Nov 24, 2021

Codecov Report

Merging #841 (d1a71e5) into master (2c2cb73) will increase coverage by 0.08%.
The diff coverage is 91.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #841      +/-   ##
==========================================
+ Coverage   86.57%   86.65%   +0.08%     
==========================================
  Files          61       61              
  Lines        7291     7435     +144     
==========================================
+ Hits         6312     6443     +131     
- Misses        979      992      +13     
Flag Coverage Δ
unittests 86.65% <91.33%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/cli/cmd_organize.py 75.60% <65.00%> (-1.27%) ⬇️
dandi/pynwb_utils.py 83.40% <86.20%> (+0.41%) ⬆️
dandi/organize.py 86.98% <93.93%> (+1.21%) ⬆️
dandi/consts.py 100.00% <100.00%> (ø)
dandi/tests/fixtures.py 96.12% <100.00%> (+0.80%) ⬆️
dandi/tests/test_metadata.py 100.00% <100.00%> (ø)
dandi/tests/test_organize.py 94.66% <100.00%> (+0.96%) ⬆️
dandi/download.py 86.96% <0.00%> (-0.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c2cb73...d1a71e5. Read the comment docs.

@Saksham20
Copy link
Contributor Author

@bendichter @yarikoptic @rly @oruebel This is the strategy that I am currently thinking of to implement supporting video files on dandi:

Currently there are 3 CLI options during the dandi organize command:

  1. files_mode=copy/move/symlink/hardlink (existed prior)
  2. rewrite=external_file (new, optional, would rewrite all the external_file attributes in the ImageSeries)
  3. external_files_mode=copy/move/symlink/hardlink (new, optional, depends on specifying rewrite)

Since we would need to change the nwbfile (rewrite the external_file) we may need different ways to deal with external_files_mode values.

  1. external_files_mode=copy/move: in this case the videos would exist in the new dandiset organised folder. Thus updating the nwbfiles' external_file attribute (with the agreed object_id naming scheme) would not break the linkage. This PR takes care of this: it enforces the user to use a copy/move for the external_files_mode when rewrite=external_file is set, else throws a ValueError. dandi upload would upload the videos along with the nwbfiles.
  2. external_files_mode=symlink/hardlink: in this case using the new value of video files location in external_file would break the link since we do not move them into the new folder structure. Keeping the local nwbfile as is would maintain this yet we need the updated value for the files uploaded on dandi. One way to work around this is to use a sidecar JSON file being implemented here: Modify dset/attr builders based on sidecar JSON hdmf-dev/hdmf#677. We could them simply upload the unchanged nwbfiles with the JSON containing the updated external_file argument. dandi upload would thus upload the json to the dandiset also. This feature can be implemented in another PR.
  3. Whether we have assisting sidecar JSON (feature 2) or feature 1, streaming the video with ros3 driver would still not be possible. We would need a way for the dandi api to generate a S3 path from the relative path of the video in the dandiset. This is an advanced feature and could be part of another future PR.

@Saksham20
Copy link
Contributor Author

Saksham20 commented Dec 3, 2021

todo:

@satra
Copy link
Member

satra commented Dec 3, 2021

@Saksham20 - have you considered the external name in NWB to be some form of a template possibly following a URI scheme. for example: {host}:video/{basename}.mp4 this would take the current file location/name and replace the name with video/{basename}.mp4. the set of keywords allowed can be discussed. but this would allow creating nwb files without having to rewrite them as the name changes. and then the only consistency required is external.

the potential advantage of such a scheme is that it could use either relative or absolute paths. an appropriate tool (e.g., pynwb, dandi cli) could also determine how to interpret host properly.

this does require interpreting the location, but even currently the user has to do that.

@yarikoptic
Copy link
Member

yarikoptic commented Dec 6, 2021

@satra (and attn @bendichter who was participating straight from an airport seat) In NWB-DANDI sync call we arrived at an agreement that the best way to proceed ATM would be

@satra
Copy link
Member

satra commented Dec 6, 2021

@yarikoptic - does that mean no possibility for templating?

@yarikoptic
Copy link
Member

I would say, no plan to support it ATM. Possibility can always exist! ;-) what particular extra use cases, in presence of sidecar file support, templating would allow for?

@satra
Copy link
Member

satra commented Dec 7, 2021

@yarikoptic - just a quick pointer to an old comment :) - #70 (comment) - more seriously, external files are not interpreted by pynwb at present as anything other than a string right? as in there is no representation of seeking into a file, or that it is a file. i'm assuming that the link could be a url and pynwb won't complain. so having a template which is resolved by a reader would be fine without even any changes to pynwb (i think). and this would allow for external referencing of a related file and potentially even bytestream and offset without the need to rewrite any component of the file. i think this becomes a better standard operating procedure moving forward than a fixed file. if one wanted to encode specific version, one could optionally embed the checksum into the link as well. using a schema uri would offer a lot of benefits in the long run i think and would still support an absolute path in the short term.

@oruebel
Copy link

oruebel commented Dec 9, 2021

more seriously, external files are not interpreted by pynwb at present as anything other than a string right?

This is true only for the special case of ImageSeries types which define an external reference to a non-HDF5 file. External links to HDF5 files are resolved automatically by HDF5.

so having a template which is resolved by a reader would be fine without even any changes to pynwb (i think)

In principle yes, but this is only because PyNWB currently does not check that the paths are valid (but it arguably should). In terms of the schema, I would argue that this sort of thing would require an update to the schema, as it explitly states this should be "Paths to one or more external file(s)."

a template which is resolved by a reader

Could you provide an example of how such a template would look like (i.e., what are the components of the template) and how a reader (e.g., PyNWB) would be able to determine the parameters for the template from the file without additional input from the user?

@Saksham20
Copy link
Contributor Author

@Saksham20 - have you considered the external name in NWB to be some form of a template possibly following a URI scheme. for example: {host}:video/{basename}.mp4 this would take the current file location/name and replace the name with video/{basename}.mp4. the set of keywords allowed can be discussed. but this would allow creating nwb files without having to rewrite them as the name changes. and then the only consistency required is external.

Wouldn't the URI change once uploaded on DANDI? Then we would have to rewrite the value in the nwbfile

@satra
Copy link
Member

satra commented Dec 9, 2021

@Saksham20 - the ideal scheme would require no renaming inside the nwb file at all. i'm not sure we can achieve ideal, but some examples are below.

Could you provide an example of how such a template would look like

sure @oruebel . it may be useful to read this article on URIs and browse this list of schemes. if you look at the list of schemes, what a scheme is completely up to us to define and convey.

example 1: nwbext://{host}/{basedir}/video/{basename}-<objectid>.mp4 --> this would be the string stored in the imageseries object.

locally pynwb or other reader would interpret this as host=localhost; basedir=current file base directory; basename=name of current file; objectid=imageseries object id. if this was being derived from https://api.dandiarchive.org/api/dandisets/<dandiset>/versions/<version>/assets/<uuid>/download with a content disposition of name=somename.nwb similar notions would hold. e.g., that the file is available in some path relative to this file. to truly make this work we would need to expose a path api ( a la github ) as opposed to the current path api. (but that's true of any current implementation as well). we would not allow the global assets endpoint to do path referencing.

example 2: more bids like: nwbext://{host}/{basedir}/{basebidsname}_video+<objectid>.mp4

basebidsname would remove the suffix, which is about technique/assay type typically in bids.

example 3: add offset nwbext://{host}/{basedir}/video/{basebidsname}_video+<objectid>.mp4;offset=4

example 4: relative path: nwbext://{host}/{basedir}/../up/down/video/megasession_video+<objectid>.mp4;offset=100

example 5: add checksum: nwbext://{host}/{basedir}/../up/down/video/megasession_video+<objectid>.mp4;offset=100;multihash=0x123456789

my use of ; simply indicates the notion that we could make this whatever. we could also use the more standard ? and &

example 5alt: add checksum: nwbext://{host}/{basedir}/../up/down/video/megasession_video+<objectid>.mp4?offset=100&multihash=0x123456789

example 6: absolute file nwbext:///some/path/to/abc.mp4 obviously this wouldn't work with dandi and the upload validator should complain, but would work for any local reader.

@oruebel
Copy link

oruebel commented Dec 9, 2021

@satra thanks for those examples. It wasn't clear to me that you were thinking about URI schemes. It seems this is effectively means designing a REST API around ImageSeries. If we do this, I think it may be worth considering whether we need to assign a neurodata_type for this kind of dataset. I'm not fully convinced yet if we really need a complex path scheme if we have sidecar file. @rly @ajtritt thoughts?

to truly make this work we would need to expose a path api

Does this mean, the user would need to know the original source of the file (e.g., that this file was orignially downloaded from DANDI) to interpret the path?

@satra
Copy link
Member

satra commented Dec 9, 2021

@oruebel - i'm not sure this would require a REST API. this is more like jinja templating.

regarding the link between files for local or downloading, that's going to be the case independent of this suggestion. if you are local, the external file needs to be in a specific location relative to the current file (unless an absolute path is used). if the relative or absolute path changes, the link breaks. if you are remote same thing holds and something needs to know/represent where the other file is.

in either case a reader can inform the user what the filename should look like, so in the worst case scenario, a user could then download the file and place it manually.

I'm not fully convinced yet if we really need a complex path scheme if we have sidecar file.

i don't quite follow what you mean by a sidecar file? are you talking about an external json file? or the video file itself. and the path scheme can be as simple as pointing to a file, but can be as complex as providing a template for a file.

@oruebel
Copy link

oruebel commented Dec 9, 2021

i don't quite follow what you mean by a sidecar file?

Sorry, I was referring to the ability to update datasets/attributes via a JSON file, that we are discussing here hdmf-dev/hdmf#677

@jwodder
Copy link
Member

jwodder commented Feb 4, 2022

@Saksham20 This PR previously contained a function for validating a movie file using cv2, which was removed when PR #853 was merged into this branch. Was the removal intentional or accidental?

dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
@bendichter
Copy link
Member

@Saksham20 This PR previously contained a function for validating a movie file using cv2, which was removed when PR #853 was merged into this branch. Was the removal intentional or accidental?

@jwodder we decided to remove that in order to make this PR smaller and easier to integrate. That section of code added opencv as a requirement and we thought that might be too heavy for this package. We are happy to add that back in a future PR.

dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
dandi/cli/cmd_organize.py Outdated Show resolved Hide resolved
dandi/pynwb_utils.py Outdated Show resolved Hide resolved
dandi/tests/fixtures.py Outdated Show resolved Hide resolved
dandi/tests/fixtures.py Outdated Show resolved Hide resolved
dandi/tests/fixtures.py Outdated Show resolved Hide resolved
dandi/tests/test_organize.py Show resolved Hide resolved
dandi/pynwb_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments.

dandi/pynwb_utils.py Outdated Show resolved Hide resolved
dandi/pynwb_utils.py Outdated Show resolved Hide resolved
dandi/pynwb_utils.py Show resolved Hide resolved
Saksham20 and others added 3 commits February 8, 2022 13:39
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: John T. Wodder II <jwodder@users.noreply.github.com>
@yarikoptic
Copy link
Member

Thank you @Saksham20 !!! All comments were addressed, @jwodder has approved the PR. I can now only do the honors of pressing the green button ;-)

@yarikoptic yarikoptic merged commit d061e3a into dandi:master Feb 8, 2022
@CodyCBakerPhD CodyCBakerPhD deleted the video_files_organize branch March 20, 2022 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

organizing video files using dandi organize
6 participants