Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to download dataset from command line #4

Open
adamjstewart opened this issue Jun 10, 2021 · 9 comments
Open

Unable to download dataset from command line #4

adamjstewart opened this issue Jun 10, 2021 · 9 comments

Comments

@adamjstewart
Copy link

adamjstewart commented Jun 10, 2021

Hi, I'm working on a torchvision-style dataset that automatically downloads and checksums SEN12MS. I see that the dataset is hosted on https://dataserv.ub.tum.de/s/m1474000. However, when I try to download one of the files, I get an error message:

$ wget 'https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz'
--2021-06-10 21:01:24--  https://dataserv.ub.tum.de/s/m1474000/download?files=ROIs1158_spring_lc.tar.gz
Resolving dataserv.ub.tum.de (dataserv.ub.tum.de)... 138.246.224.34, 2001:4ca0:800::8af6:e022
Connecting to dataserv.ub.tum.de (dataserv.ub.tum.de)|138.246.224.34|:443... connected.
ERROR: cannot verify dataserv.ub.tum.de's certificate, issued by ‘CN=DFN-Verein Global Issuing CA,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE’:
  Unable to locally verify the issuer's authority.
To connect to dataserv.ub.tum.de insecurely, use `--no-check-certificate'.

Clicking on the download button allows me to download through the web browser, but I would like to be able to download from the command line. Is this possible (without disabling security certificate checks)?

@calebrob6

@adamjstewart
Copy link
Author

Current workaround pointed out by @calebrob6:

$ wget "ftp://m1474000:m1474000@dataserv.ub.tum.de/ROIs1158_spring_lc.tar.gz"

@schmitt-muc
Copy link
Owner

Sorry for the late reply! I would prefer rsync:
"The data server also offers downloads with rsync (password m1474000):
rsync rsync://m1474000@dataserv.ub.tum.de/m1474000/"

@adamjstewart
Copy link
Author

Hi @schmitt-muc, when I run that command it doesn't download anything.

I'm trying to write a PyTorch data loader. Torchvision is able to automatically download and checksum datasets from a URL, but the FTP and rsync URLs don't work for this.

@schmitt-muc
Copy link
Owner

I have just checked (running Ubuntu 20.04 LTS from inside Windows 10 Enterprise using WSL2):
Running the command
rsync -chavzP --stats rsync://m1474000@dataserv.ub.tum.de/m1474000/ path/to/your/local/storage/folder
works. Of course you first have to enter the password m1474000, and of course retrieving the incremental file list takes ages, but it should do the job.

@adamjstewart
Copy link
Author

Yes, that seems to work, although I still can't download the data from Python without calling some system rsync executable. A normal URL would be much nicer for cases where users aren't using rsync.

@schmitt-muc
Copy link
Owner

Ah, now I understand. I suggest following Caleb Robinson's advice. At least for me wget -r "ftp://m1474000:m1474000@dataserv.ub.tum.de" does the job just fine and downloads the whole package automatically.

@adamjstewart
Copy link
Author

Yes, that URL works with wget but not with Python's urllib for some reason. Is there a working https:// option?

@schmitt-muc
Copy link
Owner

I have sent an inquiry to TUM's library, which hosts the data on their media server. The response won't make you too happy: There is definitely no https:// option, as also the .zip file you can download when clicking the Download button in the graphical interface is only created on the fly using some internal Nextcloud function. The only suggestion I got was to look into the Python libraries ftplib, wget and urllib2, which are dedicated to ftp downloads.

@schmitt-muc
Copy link
Owner

schmitt-muc commented Jul 8, 2021

There also seems to be a mirrored version on Google Cloud Storage, see https://gitlab.com/frontierdevelopmentlab/disaster-prevention/sen12ms: gsutil -m rsync -r gs://fdl_floods_2019_data/SEN12MS.
Not sure whether this is of any help for you, though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants