-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "resume" functionality to download #189
Conversation
Codecov Report
@@ Coverage Diff @@
## master #189 +/- ##
===========================================
- Coverage 81.77% 70.91% -10.86%
===========================================
Files 50 50
Lines 4186 4264 +78
===========================================
- Hits 3423 3024 -399
- Misses 763 1240 +477
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome!!! I left some "concerns" (which might not be valid) and minor recommendations.
Ideally we should also somehow test on urls from "publish" API with interruption (also somewhere in the middle of the file) so we know that restart from the middle works correctly.
We better not commit such big "test" files into git directly here. Among fixtures we already have some "fake data files" fixtures such as simple2_nwb
. We just need to add another one which would fill up some data structure within it making it large. Then another session wide fixture could upload it to our testing deployment so it could be reused in various tests. I wonder if may be that fixture even could use (and thus test) publish
API -- so we "upload" to drafts, "publish" a version of that dataset and thus get URLs to both drafts and published version... it might though that "redirector" two girder URLs PR being merged to get that one running.
if downloaded: | ||
blob1 = writer.read(len(block)) | ||
if blob1 == block: | ||
downloaded += len(block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since succesful, and we went back two chunks, either here we would need to refeed a new digester with previously downloaded data, or may be just avoid updating digester below for redownloaded block? without it -- it would end up with an incorrect digest I think, or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit ba8767f should address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks ... Going through the code I think that should work indeed! But logic is a bit tricky (at least to me). So I think we just need a good number of tests on some file longer than 3 chunks, and have 3 runs with download interruption on 1st chunk, then on 2nd, then on 3rd. And then some test with fails on multiple chunks (so may be if file was 5 chunks, fail on 3 rd and 5 th). Just to see that all aspects of restart work out nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions on how to simulate download interruptions? The best thing I've been able to find for this is https://github.com/Shopify/toxiproxy, but it doesn't handle TLS certificates intelligently, and either way I've been unable to actually connect to its proxy in a minimal setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't our docker instance run over a straight http without any encryption? I would have just tried to mock shim the requests streaming "thing" to abort with some code reaching the desired chunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... What then happens if eg connection is lost? If some other exception - let's raise that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said above, when I tried inducing a lost connection with a test script (by turning off my WiFi), there was no error. However, an SO post from six years ago says that such an event should raise a socket.timeout
.
Even if we do determine the proper error to raise, there's still the matter of what to mock and how so that the error gets raised at the right point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... I have also failed so far (by disabling/enabling WiFi) to trigger requests to give up and raise some exception (but I didn't wait for too long I guess). I have not realized that requests is really trying hard to maintain that stream. But I doubt it would try going back and resume if connection does get interrupted somehow (e.g. jumping from one network to another?).
As for what to mock, I would have mocked get_download_file_iter
which we have in both girder and dandiapi. Mock shim them would obtain original generator function, but instead of returning that one, would return one which would be a generator which would yield from the original one until it hits the chunk it wants to raise on (implementation of get_download_file_iter
in dandicli is easier to grasp - there is an explicit loop there). I would for now raise socket.timeout
SO talks about.
Related to all this is #198 where also "resume" would be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still can't figure out what URL to pass to download()
. The mock no longer interferes with http://localhost:8079/dandiset/{dandi_id}/draft
, but now the request to that URL is returning 500, presumably because of the GIRDER_URL
issue discovered in #186 that we didn't completely address because editing serve.py
wasn't necessary at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle you could test directly _download_file
but I guess it would need too much of preparation, but then you could pass shimmed downloader without mock ...
And indeed, the dandi/redirector#25 was never merged so redirector within docker compose can't talk to girder so such queries to local redirector wouldnt work
But if it is a released version , then URLs like https://github.com/dandi/dandi-cli/blob/master/dandi/download.py#L63 should work (just point to localhost ) although not sure if mapping to API server via server-info would work out.
I will try also to dig deeper later when near laptop
raise gcl.HttpError( | ||
text="Failing to download", url=url, method="GET", status=500 | ||
) | ||
return orig_sendRestRequest(self, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if any of our test files big enough so we could trigger redownload when somewhere past 2*chunk
sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Both of the files in dandi/tests/data/dandifiles
are under a megabyte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a fixture producing sizeable file for local upload/download testing as I have outlined in the review summary?
@yarikoptic Is there any particular attribute/structure of an NWB file that should be "filled up"? Is there a way to just insert random bytes? |
At first I thought to ask nwb experts on what would be the best .nwb file for testing we could produce... but since it is a file we are to generate dynamically anyways and we just need "size", let's just follow one of the tutorials at https://pynwb.readthedocs.io/en/latest/tutorials/index.html and indeed just fill up with some random junk (or 0s, which would be ok but might be suboptimal for testing download retries ;)). So, for now, just follow e.g. this tutorial on extracellular recordings (which would be useful also to get a better understanding of structures within .nwb which would be useful for later work with metadata etc): https://pynwb.readthedocs.io/en/latest/tutorials/domain/ecephys.html#sphx-glr-tutorials-domain-ecephys-py which already has code to populate that "demo" NWB with random data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some fresh(er) feedback
if downloaded: | ||
blob1 = writer.read(len(block)) | ||
if blob1 == block: | ||
downloaded += len(block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks ... Going through the code I think that should work indeed! But logic is a bit tricky (at least to me). So I think we just need a good number of tests on some file longer than 3 chunks, and have 3 runs with download interruption on 1st chunk, then on 2nd, then on 3rd. And then some test with fails on multiple chunks (so may be if file was 5 chunks, fail on 3 rd and 5 th). Just to see that all aspects of restart work out nicely.
177c709
to
a1c371a
Compare
@yarikoptic Using the new redirector image after merging dandi/redirector#25, the URL |
I think it should be safe to adjust it to |
@yarikoptic Now the tests fail on this line with |
I do not see any "proper" handling there, but would do smth like # TODO: workaround - if no server_url, assume "local-docker-tests"
server_url = known_instances[known_instances_rev[server_url.rstrip("/")] if server_url else "local-docker-tests"].girder |
Current status: |
Great! I will try to dig into it as well . |
BTW, note that although we found that we could pass Range within the header to girder, it is not necessarily it can support those internally ... in my case the testing eventually (indeed was stuck somewhere, didn't dig yet) fails with
may be we are doomed to implement this feature only for "dandiapi" portion (not worth spending time ATM on girder IMHO) and thus wait for being able to talk to django? |
@yarikoptic Actually, when I was trying to create an MVCE for the hanging request, I found that Girder didn't seem to honor the "Range" header (despite Girder's response headers indicating that it should), but it did honor an "offset" query parameter mentioned in the interactive documentation, so I changed the GirderCli method to use that in eb7e0d5. The change didn't stop the hanging, though. |
I have just merged the #232 so I guess we could progress on this one with a "large file upload/release" fixture and implementing integration testing. |
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
@yarikoptic How exactly do I do a "release" or otherwise take advantage of the "publish" container? There's no command for that in dandi-cli at the moment. |
Just go to |
Assuming that by EDIT: Found it on the dandi-publish container. |
Questions about the endpoint:
|
just in case: http://publish.dandiarchive.org/swagger/
our dandiset identifier, like anything in the API (which should have no IDs from girder), like
worth clearing it up with @dandi/dandiarchive developers (@brianhelba ?). May be an annotation for the release (name of the release)
yeah, that is an odd one, not sure why it is mandatory if asked for at all!? May be it tollerates an empty
It seems to have a default value assigned, so should be a keyword arg. In tests might want to set explicitly but should match the pattern. |
@yarikoptic It seems that the best way to perform a "publish" action at the moment is via the
The logs for the Girder container don't contain anything informative, and the Django container logs contain:
I have filed an issue with dandi-publish to support setting |
IMHO we should avoid de-tour via girder and (may be picking idea from the girder's call) provide adequate record to DANDI API's |
@yarikoptic Why should Girder be avoided? Based on conversations in Slack, reimplementing the Girder endpoint would involve two calls to dandi-publish, one or more calls to Girder to get a folder ID, and duplication of the Django credentials given to Girder. |
ok, to get us going - since DANDI |
probably do not spend time messing with girder ATM. @satra is yet to announce that next week we will push strongly to just get unified API going so all efforts will be spent on that and all "girder" will be hidden behind API so we would be able to just talk to API to accomplish what we need to accomplish. |
#247 AFAIK provided alternative implementation |
Note that this PR merges into the #134 branch, not into master.(no longer needed)Closes #182