Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF/ENH: download - reimplementation to support original and upcoming UI/API #134

Merged
merged 39 commits into from
Aug 12, 2020

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 22, 2020

RFed downloaders. I thought to make it pretty but it ended up quite messy code. Default UI though looks familiar since uses pyout as for ls and upload.

Should be well tested before releasing!

TODOs

  • finish support logic for actually downloading via new API
  • on the fly (during download) digesting (checksum) of the data
  • implement frontend (pyout) analogous to upload
  • verify that upload or anything else got broken (we have some upload unittests going, so let's hope it is all good)
  • May be later (for now "static" rendering from pyout if not interactive should suffice): provide an alternative, simpler "format" for really basic reporting so could be used in CI etc. Probably will just report as json stream or alike.

TODOs after

  • (not here) possibly add proper "sync" etc functionality
  • while at it simplify all the mess here. May be it is not worth bothering trying to start downloading while still traversing remote dataset (actually I am not longer sure it actually works that way since I think I have not seen counts in the summary having > - I must have broke it).
  • ETA would be wrong ATM for downloads where some files are already present. Probably while reworking for "sync" functionality - a fix would come naturally.

@codecov
Copy link

codecov bot commented Jul 22, 2020

Codecov Report

Merging #134 into master will increase coverage by 2.13%.
The diff coverage is 83.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #134      +/-   ##
==========================================
+ Coverage   78.11%   80.24%   +2.13%     
==========================================
  Files          42       45       +3     
  Lines        3303     3817     +514     
==========================================
+ Hits         2580     3063     +483     
- Misses        723      754      +31     
Flag Coverage Δ
#unittests 80.24% <83.69%> (+2.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/girder.py 82.01% <74.72%> (+12.70%) ⬆️
dandi/download.py 79.15% <78.68%> (+6.42%) ⬆️
dandi/dandiapi.py 85.71% <85.71%> (ø)
dandi/utils.py 74.80% <94.11%> (+1.36%) ⬆️
dandi/cli/cmd_download.py 85.71% <100.00%> (+1.09%) ⬆️
dandi/cli/cmd_ls.py 68.75% <100.00%> (ø)
dandi/cli/formatter.py 92.85% <100.00%> (ø)
dandi/consts.py 100.00% <100.00%> (ø)
dandi/support/iterators.py 100.00% <100.00%> (ø)
dandi/support/pyout.py 75.40% <100.00%> (+0.40%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6dfcb6a...20718e2. Read the comment docs.

@yarikoptic
Copy link
Member Author

@satra @mgrauer -- "Perfect is the enemy of good". I a bit "overengineered", but it should be a workable solution. Needs testing.
Give it a try while its "hot" and I can figure out what I have done (hopefully ;)).

For downloads of "released" dandisets dandi/dandi-publish#112 should be addressed first though.

@yarikoptic
Copy link
Member Author

here is some kind of a demo: asciicast

@mgrauer
Copy link
Contributor

mgrauer commented Aug 3, 2020

Thanks for the asciicast, that is some fun viewing 🍿

The UI looks nice, especially the way it updates multiple lines in parallel.

I have a couple points of discussion I'd like to raise for the CLI commands, they don't necessarily need to be addressed on this PR, but at least to start thinking about them.

  1. having the output folder can make it harder to give the user a specific command from the GUI that they can then give to the CLI, because the output folder is specific to the user's setup. i.e. it is easy to tell the user, from the GUI dandi download DANDI:000007 but not dandi download -o testdandi DANDI:000007. Maybe we could put the folder flag at the end or even put a template value in, when we display the command to run from the GUI, e.g. dandi download DANDI:000007 -output <OUTPUT_DIR> ?

  2. there is still some inconsistency in how people download drafts vs released dandisets currently:

dandi download -o testdandi DANDI:000007 for latest release
dandi download -o testdandi 'http://GUI_url_to_draft' for a draft

Could we change this to something like the following (I'm omitting output folder to simplify):

dandi download DANDI:000007 downloads latest published version
dandi download DANDI:000007:0.200714.1807 downloads this specific published version
dandi download DANDI:000007:draft downloads the draft dandiset in its current form

At first I was thinking that maybe dandi download DANDI:000007 should download a draft if there are no published versions, but upon thinking about it more, the draft is a different enough beast from a release (it is both volatile and mutable) that I'd want the user to have to be explicit about dealing with drafts.

@satra
Copy link
Member

satra commented Aug 3, 2020

DANDI:000007:draft

@mgrauer - this is not an acceptable value for the dandi prefix presently in identifiers.org, so we would either have to add that as an allowed regex or force people not to use identifiers.org for drafts. also it would be / not : for version information

dandi download DANDI:000007/0.200714.1807 downloads this specific published version
dandi download DANDI:000007/draft downloads the draft dandiset in its current form

having a persistent identifier for a draft seems a little weird. so i'm hesitant to add that there, but open to thoughts. instead we could provide the following option through the CLI

$ dandi download DANDI:000007
Please select from available versions:
0) Current draft dataset
1) 0.200714.1807
2) ...

and the draft url would always be: https://dandiarchive.org/dandiset/000007/draft. this would also force people to publish a version to use an identifier.

@satra
Copy link
Member

satra commented Aug 3, 2020

also, we would never mint a doi for a dandiset which does not have a published version. i can also add a note to the redirector to specifically point to the draft URL for an unpublished dandiset.

@yarikoptic
Copy link
Member Author

@jwodder I invite you to try this one as well.

@yarikoptic
Copy link
Member Author

  1. e.g. dandi download DANDI:000007 -output <OUTPUT_DIR> ?

that is already supported out of the box since that is how argparse (which I hope click uses underneath) does parsing - it does allow --options to be used at the end of the line. It is very useful for options which could take arbitrary number of values :

$> dandi download DANDI:000007 --output-dir myother
2020-08-05 16:30:24,676 [    INFO] Traversing remote dandisets (000007) recursively
PATH                 SIZE DONE    DONE% CHECKSUM STATUS MESSAGE   
dandiset.yaml                                    done   updated   
Summary:                  0 Bytes                1 done 1 updated 
                          <0.00%                                  
^C
Aborted!

$> ls -l myother
total 4
drwx------ 2 yoh yoh 4096 Aug  5 16:30 000007/

@yarikoptic
Copy link
Member Author

@mgrauer - this is not an acceptable value for the dandi prefix presently in identifiers.org, so we would either have to add that as an allowed regex or force people not to use identifiers.org for drafts. also it would be / not : for version information

dandi download DANDI:000007/0.200714.1807 downloads this specific published version
dandi download DANDI:000007/draft downloads the draft dandiset in its current form

I would also prefer here to use what is supported by/through identifiers.org.

Notes:

So I wondered if it could be DANDI:<ID>[@<VERSION>][/PATH] or @ isn't good for identifiers?

Re "interactive" > $ dandi download DANDI:000007 from @satra -- "not sure". I expressed similar desires for more "interactive" within datalad, but somewhat agree (with Kyle) that for commands which by default are expected to be non-interactive, behavior should be explicitly described (e.g. download latest released version if no version specified), and only with explicit --interactive|-i they should offer choices. Similar to what git does. We aren't git here, but we do expect dandi to be used in CIs etc.

@yarikoptic
Copy link
Member Author

also, we would never mint a doi for a dandiset which does not have a published version. i can also add a note to the redirector to specifically point to the draft URL for an unpublished dandiset.

I would prefer to avoid "if". That would avoid surprises for people who decide to use dandi datasets in their scripts in CI etc, so they could always know what they are downloading.

a. If we decide that it should be most recent published release, then without explicit version specification of "draft" - it should blow.

b. If we decide that it should not blow -- it should always redirect to a draft version. BUT I do not like this option since then we would need some additional latest to point to the latest release. Thus I would go with a..

…f tuple elements

It became obvious after RFing to use a helper function
Common code from girder.py was moved into download.py etc.
Found these changes uncomitted after a long period of time, committing
before progressing forward with development
…e going through individual elements

Probably overdone/ugly and should be easier/better with async. TODO REDO later
…rs work

See referenced in the diff issues about "modified" etc.
Also I removed block of code to fetch metadata for an asset if missing - API
now returns asset records with metadata
@yarikoptic
Copy link
Member Author

as no concerns were expressed upon requested "manual" testing, and we just keep breeding conflicts, I will resolve conflicts and if all good would merge.

* origin/master: (24 commits)
  Remove duplicate "requests" dependency
  get_instance(): Immediately return known instances that lack redirector URLs
  Raise helpful error if /server-info returns invalid version
  Make redirector report http://localhost:8081 as Girder URL
  Test uploading an unregistered dandiset
  Assert a certain code path is unreachable
  Test that lock_dandiset() is called when uploading
  Tests of lock_dandiset()
  Remove unused monkeypatches
  Configure flake8 to run on test code as well
  Detect nonexistent Dandisets when uploading
  ENH: upload - add locking of the dandiset
  Make pytest ignore DeprecationWarnings from responses
  Fetch all commits in test workflow
  Replace some occurrences of known_instances with get_instance()
  Test get_instance()
  CliVersionTooLowError → CliVersionTooOldError
  Move CLI version exceptions to dandi/exceptions.py
  Use semantic_version instead of packaging
  [GH-184] Add a get_instance() function
  ...

 Conflicts:
	dandi/download.py
	dandi/girder.py
	dandi/tests/test_download.py - conflicted with a linting change
	dandi/tests/test_utils.py - both added tests

 No conflicts but needed changes:
    dandi/utils.py -- needed to add dandiapi into returned dandi_instance
…"api"

I left module and class still with dandiapi names
@@ -48,12 +55,16 @@
)
@click.argument("url", nargs=-1)
@map_to_click_exceptions
def download(url, output_dir, existing, jobs=6, develop_debug=False):
def download(url, output_dir, existing, jobs=6, format="pytout"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a situation where click would leave format unfilled, and if there is, shouldn't the default be "pyout", not "pytout"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... so we can avoid duplicating default values within kwargs?! let met just strip then them away from download and we could do to the rest in a separate PR. THANK YOU! done in 4798e49

dandi/consts.py Outdated
# Chunk size when iterating a download (and upload) body. Taken from girder-cli
# TODO: should we make them smaller for download than for upload?
# ATM used only in download
MAX_CHUNK_SIZE = os.environ.get("DANDI_MAX_CHUNK_SIZE", 1024 * 1024 * 8) # 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to cast this to an int, as the value from os.environ is a string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh, right,THANKS! done in 09b2196

lgr = get_logger()


class HTTPError(requests.HTTPError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just import HTTPError instead of making a do-nothing subclass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to aggregate non-standard exceptions our code might raise within this module. I guess I could have just gone with from requests import HTTPError into this namespace indeed but then linters would not be happy (not used etc), so I somewhat like this explicit subclassing. Any disadvantage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real disadvantage other than a tiny bit of overhead. I'd personally prefer to import HTTPError and add a noqa comment to tell flake8 to be quiet.

"""
self._session = session if session else requests.Session()

yield self._session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

try:
    yield self._session
finally:
    self._session.close()
    self._session = None

so that the session gets closed on error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed needs a fix, thanks! but it should close only if session was not provided (should not close outside session). Fixed in f367313

…ffect)

It did not rebase easily into merge commit so I decided to keep it as is
)

# If success, return the json object. Otherwise throw an exception.
if result.status_code not in (200, 201):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using "if not result.ok" instead, as that only regards 4xx and 5xx status codes as errors. There are other 2xx codes that aren't errors, plus the 3xx codes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THANKS! done in a45b550

@yarikoptic
Copy link
Member Author

Thank you @jwodder for the review and suggestions! I have now pushed all the changes I have referenced.

return self.get(f"/dandisets/{dandiset_id}/versions/{version}/")

def get_dandiset_assets(
self, dandiset_id, version, location=None, page_size=None, include_metadata=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_metadata doesn't seem to be used, and there's no TODO item for it. Is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would become "parametrized" for in the query for dandi/dandi-publish#78 resolution, but now it always includes it, so indeed there is no need for it. Removed in 515538e

# dandi.cli.formatters are used in cmd_ls to provide switchable
pyout_style = pyouts.get_style(hide_if_missing=False)

rec_fields = ("path", "size", "done", "done%", "checksum", "status", "message")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be passing tuples of columns to pyout, we should add != 0.6.0 to the dependency in setup.cfg in order to avoid the version that doesn't support them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! you did fix for 0.6.0 for upload in cc4d37c but there is still ls, and I just feel it would be just easier to ask to avoid 0.6.0. done in 096c8dd

# TODO: yield progress etc
msg = {"done": downloaded}
if size:
if downloaded > size and not warned:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warned never gets set to True. Should that be added here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, set it to True in 5abd195 ... hopefully we never run into this part of code.

yield queue.get(timeout=0.001)
except Empty:
continue
if self.reraise_immediately and self._exc is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to move this check to the top of the while loop in case get times out while an exception is raised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! done in e52d728

lgr = get_logger()


class HTTPError(requests.HTTPError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real disadvantage other than a tiny bit of overhead. I'd personally prefer to import HTTPError and add a noqa comment to tell flake8 to be quiet.

@yarikoptic
Copy link
Member Author

There's no real disadvantage other than a tiny bit of overhead. I'd personally prefer to import HTTPError and add a noqa comment to tell flake8 to be quiet.

I have argued with having our exception in dandi.exceptions but even wasn't there... I removed this subclass for now completely in 20718e2

@yarikoptic
Copy link
Member Author

woohoo -- still green, merging before it ripens.

@yarikoptic yarikoptic merged commit 3db0424 into master Aug 12, 2020
@yarikoptic yarikoptic deleted the enh-download-api branch August 12, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tqdm: errors in the tests with "AssertionError: attempt to release recursive lock not owned by thread"
4 participants