Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_content_url(follow_redirects=True) should stop when reaching a "folder" #980

Closed
yarikoptic opened this issue Apr 21, 2022 · 11 comments · Fixed by #1107
Closed

get_content_url(follow_redirects=True) should stop when reaching a "folder" #980

yarikoptic opened this issue Apr 21, 2022 · 11 comments · Fixed by #1107

Comments

@yarikoptic
Copy link
Member

Discovered by @satra .
ATM for a zarr it would try to follow through with the HEAD request for the "folder" and that pukes. I think it should be ok to catch that exception and if URL already ends with / then just stop at that URL. Or may be @jwodder you see a better/more standardized way?

A sample case:

In [17]: import dandi.dandiapi as da

In [18]: dandi = da.DandiAPIClient.for_dandi_instance("dandi")

In [19]: ds = dandi.get_dandiset('000108')

In [20]: asset = ds.get_asset_by_path('sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-LEC_run-1_chunk-1_SPIM.ngff')

In [21]: asset.get_content_url('s3')
Out[21]: 'https://dandiarchive.s3.amazonaws.com/zarr/56509720-870c-4f43-ae41-7b75f9590722/'

In [22]: asset.get_content_url('s3', follow_redirects=True)
Error 404 while sending HEAD request to https://dandiarchive.s3.amazonaws.com/zarr/56509720-870c-4f43-ae41-7b75f9590722/: 
---------------------------------------------------------------------------
HTTP404Error                              Traceback (most recent call last)
<ipython-input-22-7673eb38d60d> in <module>
----> 1 asset.get_content_url('s3', follow_redirects=True)

~/proj/dandi/dandi-cli-master/dandi/dandiapi.py in get_content_url(self, regex, follow_redirects, strip_query)
   1251             )
   1252         if follow_redirects is True:
-> 1253             url = self.client.request(
   1254                 "HEAD", url, json_resp=False, allow_redirects=True
   1255             ).url

~/proj/dandi/dandi-cli-master/dandi/dandiapi.py in request(self, method, path, params, data, files, json, headers, json_resp, retry_statuses, **kwargs)
    241                 lgr.error("%s: %s", msg, result.text)
    242             if result.status_code == 404:
--> 243                 raise HTTP404Error(msg, response=result)
    244             else:
    245                 raise requests.HTTPError(msg, response=result)

HTTP404Error: Error 404 while sending HEAD request to https://dandiarchive.s3.amazonaws.com/zarr/56509720-870c-4f43-ae41-7b75f9590722/

In [23]: asset.get_content_url()
Out[23]: 'https://api.dandiarchive.org/api/assets/1838aeed-c773-400e-84b1-be0b8fa63dd9/download/'

In [24]: asset.get_content_url(follow_redirects=True)
Error 404 while sending HEAD request to https://api.dandiarchive.org/api/assets/1838aeed-c773-400e-84b1-be0b8fa63dd9/download/: 
---------------------------------------------------------------------------
HTTP404Error                              Traceback (most recent call last)
<ipython-input-24-b6f368972ab4> in <module>
----> 1 asset.get_content_url(follow_redirects=True)

~/proj/dandi/dandi-cli-master/dandi/dandiapi.py in get_content_url(self, regex, follow_redirects, strip_query)
   1251             )
   1252         if follow_redirects is True:
-> 1253             url = self.client.request(
   1254                 "HEAD", url, json_resp=False, allow_redirects=True
   1255             ).url

~/proj/dandi/dandi-cli-master/dandi/dandiapi.py in request(self, method, path, params, data, files, json, headers, json_resp, retry_statuses, **kwargs)
    241                 lgr.error("%s: %s", msg, result.text)
    242             if result.status_code == 404:
--> 243                 raise HTTP404Error(msg, response=result)
    244             else:
    245                 raise requests.HTTPError(msg, response=result)

HTTP404Error: Error 404 while sending HEAD request to https://api.dandiarchive.org/api/assets/1838aeed-c773-400e-84b1-be0b8fa63dd9/download/

the question is either we should not bother redirecting for any / or only for the urls from the s3.amazonaws.com subdomains since we know that there is no real "folder/" on S3 (although "folder" key could exist)... for now I think we could stop following for any / url

@jwodder
Copy link
Member

jwodder commented Apr 22, 2022

@yarikoptic I don't think it makes much sense to call get_content_url() on a Zarr in the first place. The purpose of the method is to get a URL for downloading the asset, but there is no such URL for a Zarr.

@yarikoptic
Copy link
Member Author

Well, I guess it is just a matter of generic programming to be able to request target content URL regardless of its type

@jwodder
Copy link
Member

jwodder commented Apr 22, 2022

@yarikoptic I'm not clear on what you're saying. Are you still asking for the change described in the original comment or something else?

@yarikoptic
Copy link
Member Author

yarikoptic commented Apr 22, 2022

Yes, I think it would be a good idea for function to not crash - let's just not try following redirects on s3 folders

@jwodder
Copy link
Member

jwodder commented Apr 22, 2022

@yarikoptic So what exactly do you want the new behavior to be? Should it be that, if an HTTP error response is returned for a URL that ends in a slash, the URL is returned instead of erroring?

@jwodder
Copy link
Member

jwodder commented May 3, 2022

@yarikoptic Ping; please respond to last comment.

@yarikoptic
Copy link
Member Author

I think "the URL is returned instead of erroring" would provide us the desired effect here.

@jwodder
Copy link
Member

jwodder commented May 5, 2022

@yarikoptic Note that the code currently just takes a URL from the asset metadata and lets requests follow all the redirects. If this URL does not end in a slash but one of the URLs it is redirected to does (or vice versa), and the final result is an HTTP error, what should happen?

@jwodder
Copy link
Member

jwodder commented Jul 27, 2022

@yarikoptic I still need an answer to my last comment.

@yarikoptic
Copy link
Member Author

just return the last URL for which HEAD request failed. In effect "we follow redirects until we cannot"

yarikoptic added a commit that referenced this issue Aug 24, 2022
get_content_url(): If a HEAD fails, return the failing URL
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

🚀 Issue was released in 0.46.2 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants