Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load CelebA dataset. File is not zip file error. #2262

Closed
ajayrfhp opened this issue May 25, 2020 · 25 comments · Fixed by #2321
Closed

Unable to load CelebA dataset. File is not zip file error. #2262

ajayrfhp opened this issue May 25, 2020 · 25 comments · Fixed by #2321

Comments

@ajayrfhp
Copy link

🐛 Bug

Unable to download and load celeba dataset into a loader.

To Reproduce

  1. Try to load CeleBA dataset with download true returns error
batch_size=25
train_loader = torch.utils.data.DataLoader(
        datasets.CelebA('../data', split="train", download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.5,), (0.5,))
                       ])),
        batch_size=batch_size, shuffle=True)

Returns

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in __init__(self, root, split, target_type, transform, target_transform, download)
     64 
     65         if download:
---> 66             self.download()
     67 
     68         if not self._check_integrity():

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in download(self)
    118             download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5)
    119 
--> 120         with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
    121             f.extractall(os.path.join(self.root, self.base_folder))
    122 

/usr/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

Environment

  • PyTorch version: 1.5.0+cu101

  • Is debug build: No

  • CUDA used to build PyTorch: 10.1

  • OS: Ubuntu 18.04.3 LTS

  • GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

  • CMake version: version 3.12.0

Python version: 3.6

  • Is CUDA available: Yes
  • CUDA runtime version: 10.1.243
  • GPU models and configuration: GPU 0: Tesla T4
  • Nvidia driver version: 418.67
  • cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:

  • [pip3] numpy==1.18.4
  • [pip3] torch==1.5.0+cu101
  • [pip3] torchsummary==1.5.1
  • [pip3] torchtext==0.3.1
  • [pip3] torchvision==0.6.0+cu101
@ajayrfhp ajayrfhp changed the title Get file is not zip file when trying to download CelebA dataset Unable to load CelebA dataset. File is not zip file error. May 25, 2020
@ezyang ezyang transferred this issue from pytorch/pytorch May 26, 2020
@pmeier
Copy link
Collaborator

pmeier commented May 27, 2020

This has nothing to do with the loader. We can get the same result with

from torchvision import datasets

dataset = datasets.CelebA(".", split="train", download=True,)

The underlying problem was reported in #1920: Google Drive has a daily maximum quota for any file, which seems to be exceeded for the CelebA files. You can see this in the response which is mindlessly written to every .txt and also .zip file.

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;1659352109&#45;untrustedcontent.css rel="stylesheet"><link rel="icon" href="https://ssl.gstatic.com/docs/doclist/images/infinite_arrow_favicon_4.ico"/><style nonce="0AwDvc7jesmreq9s3Zkdcw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="0AwDvc7jesmreq9s3Zkdcw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.de/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.de/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.de/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=DE&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 href="https://www.google.com/calendar?tab=oc">Calendar</a> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.de/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D0B7EVK8r0v71pY0NSMzRuSXJEVkk&service=writely" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

@ajayrfhp The only "solution" we can offer is to tell you to wait and try again, since we have no control about your issue. You can ask the author of the dataset to host it on a platform that does not have daily quotas. If you do and he goes through with your proposal please inform us so that we can adapt our code.

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

@fmassa
Copy link
Member

fmassa commented May 29, 2020

@pmeier thanks for looking into this!

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

Is this something that could be done in the download_from_url function, or would it need to be done on a case-by-case basis?

@pmeier
Copy link
Collaborator

pmeier commented May 29, 2020

CelebA uses download_file_from_google_drive and I would put the fix before L167:

response = session.get(url, params={'id': file_id}, stream=True)
token = _get_confirm_token(response)
if token:
params = {'id': file_id, 'confirm': token}
response = session.get(url, params=params, stream=True)
_save_response_content(response, fpath)

Maybe it is as easy as checking the response.status_code.

Problem I see is that we need wait until we have a day where the quota is exceeded and fix it instantly. Furthermore, I have no idea how to test this.

@ajayrfhp
Copy link
Author

I see. Thanks, I will download at a later point then.

@fmassa
Copy link
Member

fmassa commented Jun 1, 2020

@pmeier your fix sounds good to me, but indeed, this might be difficult to test.

@pmeier
Copy link
Collaborator

pmeier commented Jun 1, 2020

@fmassa I suggest we wait for another issue raising this problem. At least I won't check daily if this quota is exceeded. If there is another issue for this and I miss it or you somehow find a day when we can fix this feel free to tag me in. I'll see what I can do.

@fmassa
Copy link
Member

fmassa commented Jun 1, 2020

Sounds good, thanks a lot @pmeier !

@jotterbach
Copy link

Seems this is a known issue, but wanted to raise this again as per @pmeier 's comment. I didn't want to open another ticket on this though.

@pmeier
Copy link
Collaborator

pmeier commented Aug 7, 2020

@jotterbach This was fixed in e757d52 but didn't make it in the latest release.

@import-antigravity
Copy link

This is still an issue FYI

@AndrewUlmer
Copy link

I would just like to add that the authors also include a Baidu drive you can download the data from on their website. The dataset is also available on Kaggle.

@sayantanauddy
Copy link

Can a Dataset class (like this) that downloads the data from Kaggle (using the Kaggle API) be a possible solution?

@FrancescoSaverioZuppichini

same

@rykovv
Copy link

rykovv commented Sep 29, 2021

Run into the same problem. In the original Google Drive shared folder the dataset files are placed in different directories (Anno, Eval, Img), whereas there is no download path indication in celeba.py. This probably causes the error.

@pmeier
Copy link
Collaborator

pmeier commented Sep 30, 2021

This was fixed in #4109, but the commit is not yet included in a stable release. It will be in the upcoming one.

@marzmesas
Copy link

This issue is still persisting, is there a way to get the dataset and load it just like we would through torchvision.datasets

@xyjixyjixyji
Copy link

Problem still exists. (Jun 14)

@gabriben
Copy link

gabriben commented Jun 20, 2022

the kaggle alternative worked for me

@univanxx
Copy link

univanxx commented Sep 2, 2022

Hello everyone! Based on this discussion, this steps can help you (for me they perfectly worked):

  1. Create directory named celeba and download to it all files from CelebA google Drive mentioned in this file_list
  2. Unzip img_align_celeba.zip in ./celeba directory (I'm not sure if you should delete zip-file after unpacking)
  3. And run the code necessarily with download=False parameter:
import torchvision.datasets as dset
img_path = './celeba'
data = dset.celeba.CelebA(root=img_path, split="train", target_type='attr', transform=None, download=False)

This tutorial worked for me!

@viomirea
Copy link

I had the same issue. I had problems and the installation.
I tried also manually download the zip from the link used in python https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/celeba.zip . I tried downloading it from the web browser and I needed to retry a few times at it was still failing. After downloading the zip from the browser I copied it to ./celeba/

@jS5t3r
Copy link

jS5t3r commented May 5, 2023

the celeba loader cannot read celeba in HQ, right?

@viomirea
Copy link

viomirea commented May 10, 2023

In my case it seems that there was a problem when using the wifi. After I connected using the LAN I had no timeouts anymore

I had the same issue. I had problems and the installation. I tried also manually download the zip from the link used in python https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/celeba.zip . I tried downloading it from the web browser and I needed to retry a few times at it was still failing. After downloading the zip from the browser I copied it to ./celeba/

@ldr7
Copy link

ldr7 commented Jul 9, 2023

Problem still exists

@ozturkoktay
Copy link

The problem still exists.

@giulio98
Copy link

Hello i have uploaded celeba into 🤗 Datasets.
eurecom-ds/celeba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.