Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "resume" functionality to download #189

Closed
wants to merge 12 commits into from
Closed

Add "resume" functionality to download #189

wants to merge 12 commits into from

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Aug 12, 2020

Note that this PR merges into the #134 branch, not into master. (no longer needed)

Closes #182

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #189 into master will decrease coverage by 10.85%.
The diff coverage is 57.89%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #189       +/-   ##
===========================================
- Coverage   81.77%   70.91%   -10.86%     
===========================================
  Files          50       50               
  Lines        4186     4264       +78     
===========================================
- Hits         3423     3024      -399     
- Misses        763     1240      +477     
Flag Coverage Δ
#unittests 70.91% <57.89%> (-10.86%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/download.py 80.54% <53.12%> (-2.80%) ⬇️
dandi/tests/fixtures.py 68.18% <58.53%> (-28.31%) ⬇️
dandi/girder.py 65.68% <61.11%> (-16.33%) ⬇️
dandi/dandiapi.py 82.17% <75.00%> (-0.48%) ⬇️
dandi/tests/test_girder.py 20.00% <0.00%> (-80.00%) ⬇️
dandi/tests/test_upload.py 21.12% <0.00%> (-78.88%) ⬇️
dandi/tests/test_register.py 28.57% <0.00%> (-71.43%) ⬇️
dandi/upload.py 3.23% <0.00%> (-70.86%) ⬇️
dandi/support/generatorify.py 0.00% <0.00%> (-64.37%) ⬇️
dandi/register.py 33.33% <0.00%> (-37.04%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d091704...122b6dc. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!!! I left some "concerns" (which might not be valid) and minor recommendations.

Ideally we should also somehow test on urls from "publish" API with interruption (also somewhere in the middle of the file) so we know that restart from the middle works correctly.

We better not commit such big "test" files into git directly here. Among fixtures we already have some "fake data files" fixtures such as simple2_nwb. We just need to add another one which would fill up some data structure within it making it large. Then another session wide fixture could upload it to our testing deployment so it could be reused in various tests. I wonder if may be that fixture even could use (and thus test) publish API -- so we "upload" to drafts, "publish" a version of that dataset and thus get URLs to both drafts and published version... it might though that "redirector" two girder URLs PR being merged to get that one running.

dandi/download.py Show resolved Hide resolved
if downloaded:
blob1 = writer.read(len(block))
if blob1 == block:
downloaded += len(block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since succesful, and we went back two chunks, either here we would need to refeed a new digester with previously downloaded data, or may be just avoid updating digester below for redownloaded block? without it -- it would end up with an incorrect digest I think, or am I wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit ba8767f should address this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks ... Going through the code I think that should work indeed! But logic is a bit tricky (at least to me). So I think we just need a good number of tests on some file longer than 3 chunks, and have 3 runs with download interruption on 1st chunk, then on 2nd, then on 3rd. And then some test with fails on multiple chunks (so may be if file was 5 chunks, fail on 3 rd and 5 th). Just to see that all aspects of restart work out nicely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions on how to simulate download interruptions? The best thing I've been able to find for this is https://github.com/Shopify/toxiproxy, but it doesn't handle TLS certificates intelligently, and either way I've been unable to actually connect to its proxy in a minimal setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't our docker instance run over a straight http without any encryption? I would have just tried to mock shim the requests streaming "thing" to abort with some code reaching the desired chunk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... What then happens if eg connection is lost? If some other exception - let's raise that one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, when I tried inducing a lost connection with a test script (by turning off my WiFi), there was no error. However, an SO post from six years ago says that such an event should raise a socket.timeout.

Even if we do determine the proper error to raise, there's still the matter of what to mock and how so that the error gets raised at the right point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... I have also failed so far (by disabling/enabling WiFi) to trigger requests to give up and raise some exception (but I didn't wait for too long I guess). I have not realized that requests is really trying hard to maintain that stream. But I doubt it would try going back and resume if connection does get interrupted somehow (e.g. jumping from one network to another?).

As for what to mock, I would have mocked get_download_file_iter which we have in both girder and dandiapi. Mock shim them would obtain original generator function, but instead of returning that one, would return one which would be a generator which would yield from the original one until it hits the chunk it wants to raise on (implementation of get_download_file_iter in dandicli is easier to grasp - there is an explicit loop there). I would for now raise socket.timeout SO talks about.

Related to all this is #198 where also "resume" would be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still can't figure out what URL to pass to download(). The mock no longer interferes with http://localhost:8079/dandiset/{dandi_id}/draft, but now the request to that URL is returning 500, presumably because of the GIRDER_URL issue discovered in #186 that we didn't completely address because editing serve.py wasn't necessary at the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle you could test directly _download_file but I guess it would need too much of preparation, but then you could pass shimmed downloader without mock ...

And indeed, the dandi/redirector#25 was never merged so redirector within docker compose can't talk to girder so such queries to local redirector wouldnt work

But if it is a released version , then URLs like https://github.com/dandi/dandi-cli/blob/master/dandi/download.py#L63 should work (just point to localhost ) although not sure if mapping to API server via server-info would work out.

I will try also to dig deeper later when near laptop

dandi/girder.py Outdated Show resolved Hide resolved
raise gcl.HttpError(
text="Failing to download", url=url, method="GET", status=500
)
return orig_sendRestRequest(self, *args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any of our test files big enough so we could trigger redownload when somewhere past 2*chunk sizes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Both of the files in dandi/tests/data/dandifiles are under a megabyte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a fixture producing sizeable file for local upload/download testing as I have outlined in the review summary?

Base automatically changed from enh-download-api to master August 12, 2020 20:08
@jwodder
Copy link
Member Author

jwodder commented Aug 12, 2020

@yarikoptic Is there any particular attribute/structure of an NWB file that should be "filled up"? Is there a way to just insert random bytes?

@yarikoptic
Copy link
Member

At first I thought to ask nwb experts on what would be the best .nwb file for testing we could produce... but since it is a file we are to generate dynamically anyways and we just need "size", let's just follow one of the tutorials at https://pynwb.readthedocs.io/en/latest/tutorials/index.html and indeed just fill up with some random junk (or 0s, which would be ok but might be suboptimal for testing download retries ;)).

So, for now, just follow e.g. this tutorial on extracellular recordings (which would be useful also to get a better understanding of structures within .nwb which would be useful for later work with metadata etc): https://pynwb.readthedocs.io/en/latest/tutorials/domain/ecephys.html#sphx-glr-tutorials-domain-ecephys-py which already has code to populate that "demo" NWB with random data.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fresh(er) feedback

if downloaded:
blob1 = writer.read(len(block))
if blob1 == block:
downloaded += len(block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks ... Going through the code I think that should work indeed! But logic is a bit tricky (at least to me). So I think we just need a good number of tests on some file longer than 3 chunks, and have 3 runs with download interruption on 1st chunk, then on 2nd, then on 3rd. And then some test with fails on multiple chunks (so may be if file was 5 chunks, fail on 3 rd and 5 th). Just to see that all aspects of restart work out nicely.

dandi/download.py Outdated Show resolved Hide resolved
dandi/download.py Outdated Show resolved Hide resolved
@jwodder jwodder force-pushed the gh-182 branch 2 times, most recently from 177c709 to a1c371a Compare August 19, 2020 13:05
@jwodder
Copy link
Member Author

jwodder commented Aug 25, 2020

@yarikoptic Using the new redirector image after merging dandi/redirector#25, the URL http://localhost:8079/dandiset/{dandi_id}/draft successfully redirects to http://localhost:8086/#/dandiset/000004/draft, but then that causes the tests to fail on this line with 'NoneType' object has no attribute 'endswith'. This appears to be due to the local-docker-tests instance in known_instances having a None value for its api attribute.

@yarikoptic
Copy link
Member

but then that causes the tests to fail on this line with 'NoneType' object has no attribute 'endswith'

I think it should be safe to adjust it to if server and not server.endswith("/"):

@jwodder
Copy link
Member Author

jwodder commented Aug 25, 2020

@yarikoptic Now the tests fail on this line with 'NoneType' object has no attribute 'rstrip'.

@yarikoptic
Copy link
Member

I do not see any "proper" handling there, but would do smth like

    # TODO: workaround - if no server_url, assume "local-docker-tests"
    server_url = known_instances[known_instances_rev[server_url.rstrip("/")] if server_url else "local-docker-tests"].girder

@jwodder
Copy link
Member Author

jwodder commented Aug 25, 2020

Current status: test_resume_download is hanging for upwards of ten minutes. Inserting logging & print statements shows that the hang occurs when the original GirderCli.get_download_file_iter(file_id, chunk_size) is called for the second time (specifically, at the call to self.sendRestRequest() inside the method). I am unable to reproduce this in a minimal test script.

@yarikoptic
Copy link
Member

Great! I will try to dig into it as well .

@yarikoptic
Copy link
Member

BTW, note that although we found that we could pass Range within the header to girder, it is not necessarily it can support those internally ... in my case the testing eventually (indeed was stuck somewhere, didn't dig yet) fails with

dandi/tests/test_download.py:269: in test_resume_download
    download(large_dandiset, tmp_path)
dandi/download.py:308: in download
    for rec in gen_:
dandi/download.py:452: in download_generator
    for resp in _download_generator:
dandi/download.py:725: in _download_file
    for block in downloader(start_at=downloaded):
dandi/tests/test_download.py:258: in new_downloader
    assert start_at == expected_start
E   assert 0 == 25165824
E     -0
E     +25165824

may be we are doomed to implement this feature only for "dandiapi" portion (not worth spending time ATM on girder IMHO) and thus wait for being able to talk to django?

@jwodder
Copy link
Member Author

jwodder commented Aug 26, 2020

@yarikoptic Actually, when I was trying to create an MVCE for the hanging request, I found that Girder didn't seem to honor the "Range" header (despite Girder's response headers indicating that it should), but it did honor an "offset" query parameter mentioned in the interactive documentation, so I changed the GirderCli method to use that in eb7e0d5. The change didn't stop the hanging, though.

@yarikoptic
Copy link
Member

I have just merged the #232 so I guess we could progress on this one with a "large file upload/release" fixture and implementing integration testing.

@jwodder
Copy link
Member Author

jwodder commented Sep 8, 2020

@yarikoptic How exactly do I do a "release" or otherwise take advantage of the "publish" container? There's no command for that in dandi-cli at the moment.

@yarikoptic
Copy link
Member

Just go to /swagger interface of the running instance, there should be /publish end point - something like /dandisets/{dandiset_id}/versions/{version}/publish (or the one with {version} in the options to the call, don't remember). Then add interface for it to dandiapi.DandiAPIClient class.

@jwodder
Copy link
Member Author

jwodder commented Sep 8, 2020

Assuming that by /swagger you mean the /api/v1 page of Girder, I see no such endpoint, not even when logged in as admin to an instance connected to dandi-publish using the latest versions of the Docker images.

EDIT: Found it on the dandi-publish container.

@jwodder
Copy link
Member Author

jwodder commented Sep 8, 2020

Questions about the endpoint:

  • Is dandiset__pk supposed to be the dandiset identifier, a Girder ID, or something else?
  • What exactly is the "name" parameter in the request body for?
  • What exactly is the "dandiset" parameter in the request body for, seeing as the dandiset is already specified in the request path?
  • Should the "version" parameter in the request body be set when making the request? Should the Python method require a "version" argument?

@yarikoptic
Copy link
Member

EDIT: Found it on the dandi-publish container.

just in case: http://publish.dandiarchive.org/swagger/

Is dandiset__pk supposed to be the dandiset identifier, a Girder ID, or something else?

our dandiset identifier, like anything in the API (which should have no IDs from girder), like 000001

What exactly is the "name" parameter in the request body for?

worth clearing it up with @dandi/dandiarchive developers (@brianhelba ?). May be an annotation for the release (name of the release)

What exactly is the "dandiset" parameter in the request body for, seeing as the dandiset is already specified in the request path?

yeah, that is an odd one, not sure why it is mandatory if asked for at all!? May be it tollerates an empty {}?

Should the "version" parameter in the request body be set when making the request? Should the Python method require a "version" argument?

It seems to have a default value assigned, so should be a keyword arg. In tests might want to set explicitly but should match the pattern.

@jwodder
Copy link
Member Author

jwodder commented Sep 8, 2020

@yarikoptic It seems that the best way to perform a "publish" action at the moment is via the POST /dandi/{dandiset_id} endpoint in Girder, which makes the appropriate calls to dandi-publish. However, when I call this endpoint in the test fixture, it fails with:

girder_client.HttpError: HTTP error 400: POST http://localhost:8081/api/v1/dandi/000004
Response text: {"message": "Failed to sync 000004", "type": "rest"}

The logs for the Girder container don't contain anything informative, and the Django container logs contain:

[19:58:26] ERROR    Invalid HTTP_HOST header: 'django:8000'. You exception.py:75
                    may need to add 'django' to ALLOWED_HOSTS.                  
           WARNING  Bad Request: /api/dandisets/sync/                 log.py:222
[19:58:26] WARNING  "POST /api/dandisets/sync/?folder-id=5f57e25 basehttp.py:157
                    b71c98353d53b4783 HTTP/1.1" 400 75749   

I have filed an issue with dandi-publish to support setting ALLOWED_HOSTS via an environment variable: dandi/dandi-publish#148

@yarikoptic
Copy link
Member

It seems that the best way to perform a "publish" action at the moment is via the POST /dandi/{dandiset_id} endpoint in Girder, which makes the appropriate calls to dandi-publish.

IMHO we should avoid de-tour via girder and (may be picking idea from the girder's call) provide adequate record to DANDI API's /publish/ call .

@jwodder
Copy link
Member Author

jwodder commented Sep 9, 2020

@yarikoptic Why should Girder be avoided? Based on conversations in Slack, reimplementing the Girder endpoint would involve two calls to dandi-publish, one or more calls to Girder to get a folder ID, and duplication of the Django credentials given to Girder.

@yarikoptic
Copy link
Member

ok, to get us going - since DANDI /publish/ API seems to be not in a usable independently (without interactions with girder) yet (although it could/should have been), lets invoke girder endpoint for now.
But my thinking is that in the longer run we should exercise DANDI API endpoints and avoid girder end points since client should aim to talk to DANDI API not Girder. To that end filed dandi/dandi-publish#149 and we will just RF our code to talk to /publish/ directly after that issue is resolved.

@yarikoptic
Copy link
Member

probably do not spend time messing with girder ATM. @satra is yet to announce that next week we will push strongly to just get unified API going so all efforts will be spent on that and all "girder" will be hidden behind API so we would be able to just talk to API to accomplish what we need to accomplish.

@yarikoptic yarikoptic added the on-hold-waiting-for-API-service TEMP label for PRs to not worry about until new API service+backend appear label Sep 14, 2020
@yarikoptic
Copy link
Member

#247 AFAIK provided alternative implementation

@yarikoptic yarikoptic closed this Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold-waiting-for-API-service TEMP label for PRs to not worry about until new API service+backend appear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

download: add "resume" functionality
2 participants