Unable to load CelebA dataset. File is not zip file error. #2262

ajayrfhp · 2020-05-25T18:16:03Z

🐛 Bug

Unable to download and load celeba dataset into a loader.

To Reproduce

Try to load CeleBA dataset with download true returns error

batch_size=25
train_loader = torch.utils.data.DataLoader(
        datasets.CelebA('../data', split="train", download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.5,), (0.5,))
                       ])),
        batch_size=batch_size, shuffle=True)

Returns

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in __init__(self, root, split, target_type, transform, target_transform, download)
     64 
     65         if download:
---> 66             self.download()
     67 
     68         if not self._check_integrity():

/usr/local/lib/python3.6/dist-packages/torchvision/datasets/celeba.py in download(self)
    118             download_file_from_google_drive(file_id, os.path.join(self.root, self.base_folder), filename, md5)
    119 
--> 120         with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
    121             f.extractall(os.path.join(self.root, self.base_folder))
    122 

/usr/lib/python3.6/zipfile.py in __init__(self, file, mode, compression, allowZip64)
   1129         try:
   1130             if mode == 'r':
-> 1131                 self._RealGetContents()
   1132             elif mode in ('w', 'x'):
   1133                 # set the modified flag so central directory gets written

/usr/lib/python3.6/zipfile.py in _RealGetContents(self)
   1196             raise BadZipFile("File is not a zip file")
   1197         if not endrec:
-> 1198             raise BadZipFile("File is not a zip file")
   1199         if self.debug > 1:
   1200             print(endrec)

BadZipFile: File is not a zip file

Environment

PyTorch version: 1.5.0+cu101
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.12.0

Python version: 3.6

Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 418.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries:

[pip3] numpy==1.18.4
[pip3] torch==1.5.0+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.6.0+cu101

The text was updated successfully, but these errors were encountered:

pmeier · 2020-05-27T09:13:53Z

This has nothing to do with the loader. We can get the same result with

from torchvision import datasets

dataset = datasets.CelebA(".", split="train", download=True,)

The underlying problem was reported in #1920: Google Drive has a daily maximum quota for any file, which seems to be exceeded for the CelebA files. You can see this in the response which is mindlessly written to every .txt and also .zip file.

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;1659352109&#45;untrustedcontent.css rel="stylesheet"><link rel="icon" href="https://ssl.gstatic.com/docs/doclist/images/infinite_arrow_favicon_4.ico"/><style nonce="0AwDvc7jesmreq9s3Zkdcw">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="0AwDvc7jesmreq9s3Zkdcw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.de/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.de/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.de/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=DE&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 href="https://www.google.com/calendar?tab=oc">Calendar</a> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.de/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D0B7EVK8r0v71pY0NSMzRuSXJEVkk&service=writely" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2020 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

@ajayrfhp The only "solution" we can offer is to tell you to wait and try again, since we have no control about your issue. You can ask the author of the dataset to host it on a platform that does not have daily quotas. If you do and he goes through with your proposal please inform us so that we can adapt our code.

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

fmassa · 2020-05-29T13:36:20Z

@pmeier thanks for looking into this!

@fmassa We should check the contents of the response first before we write them to the files and raise a descriptive error message.

Is this something that could be done in the download_from_url function, or would it need to be done on a case-by-case basis?

pmeier · 2020-05-29T18:01:23Z

CelebA uses download_file_from_google_drive and I would put the fix before L167:

vision/torchvision/datasets/utils.py

Lines 160 to 167 in a85f21d

    
           response = session.get(url, params={'id': file_id}, stream=True) 
        
           token = _get_confirm_token(response) 
        
           if token: 
        
               params = {'id': file_id, 'confirm': token} 
        
               response = session.get(url, params=params, stream=True) 
        
           _save_response_content(response, fpath)

Maybe it is as easy as checking the response.status_code.

Problem I see is that we need wait until we have a day where the quota is exceeded and fix it instantly. Furthermore, I have no idea how to test this.

ajayrfhp · 2020-05-31T17:15:02Z

I see. Thanks, I will download at a later point then.

fmassa · 2020-06-01T10:20:24Z

@pmeier your fix sounds good to me, but indeed, this might be difficult to test.

pmeier · 2020-06-01T13:38:21Z

@fmassa I suggest we wait for another issue raising this problem. At least I won't check daily if this quota is exceeded. If there is another issue for this and I miss it or you somehow find a day when we can fix this feel free to tag me in. I'll see what I can do.

fmassa · 2020-06-01T16:18:35Z

Sounds good, thanks a lot @pmeier !

jotterbach · 2020-08-07T15:39:59Z

Seems this is a known issue, but wanted to raise this again as per @pmeier 's comment. I didn't want to open another ticket on this though.

pmeier · 2020-08-07T16:38:17Z

@jotterbach This was fixed in e757d52 but didn't make it in the latest release.

import-antigravity · 2021-03-16T23:36:26Z

This is still an issue FYI

AndrewUlmer · 2021-03-17T16:12:16Z

I would just like to add that the authors also include a Baidu drive you can download the data from on their website. The dataset is also available on Kaggle.

sayantanauddy · 2021-04-09T09:23:40Z

Can a Dataset class (like this) that downloads the data from Kaggle (using the Kaggle API) be a possible solution?

FrancescoSaverioZuppichini · 2021-04-16T08:55:21Z

same

rykovv · 2021-09-29T21:11:42Z

Run into the same problem. In the original Google Drive shared folder the dataset files are placed in different directories (Anno, Eval, Img), whereas there is no download path indication in celeba.py. This probably causes the error.

pmeier · 2021-09-30T05:14:51Z

This was fixed in #4109, but the commit is not yet included in a stable release. It will be in the upcoming one.

marzmesas · 2022-03-29T16:41:14Z

This issue is still persisting, is there a way to get the dataset and load it just like we would through torchvision.datasets

xyjixyjixyji · 2022-06-14T10:27:30Z

Problem still exists. (Jun 14)

gabriben · 2022-06-20T13:28:16Z

the kaggle alternative worked for me

univanxx · 2022-09-02T17:34:37Z

Hello everyone! Based on this discussion, this steps can help you (for me they perfectly worked):

Create directory named celeba and download to it all files from CelebA google Drive mentioned in this file_list
Unzip img_align_celeba.zip in ./celeba directory (I'm not sure if you should delete zip-file after unpacking)
And run the code necessarily with download=False parameter:

import torchvision.datasets as dset
img_path = './celeba'
data = dset.celeba.CelebA(root=img_path, split="train", target_type='attr', transform=None, download=False)

This tutorial worked for me!

viomirea · 2023-04-10T19:35:47Z

I had the same issue. I had problems and the installation.
I tried also manually download the zip from the link used in python https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/celeba.zip . I tried downloading it from the web browser and I needed to retry a few times at it was still failing. After downloading the zip from the browser I copied it to ./celeba/

jS5t3r · 2023-05-05T12:42:48Z

the celeba loader cannot read celeba in HQ, right?

viomirea · 2023-05-10T10:16:26Z

In my case it seems that there was a problem when using the wifi. After I connected using the LAN I had no timeouts anymore

I had the same issue. I had problems and the installation. I tried also manually download the zip from the link used in python https://s3-us-west-1.amazonaws.com/udacity-dlnfd/datasets/celeba.zip . I tried downloading it from the web browser and I needed to retry a few times at it was still failing. After downloading the zip from the browser I copied it to ./celeba/

ldr7 · 2023-07-09T03:33:37Z

Problem still exists

ozturkoktay · 2023-09-22T20:29:57Z

The problem still exists.

giulio98 · 2024-04-21T17:31:17Z

Hello i have uploaded celeba into 🤗 Datasets.
eurecom-ds/celeba

ajayrfhp changed the title ~~Get file is not zip file when trying to download CelebA dataset~~ Unable to load CelebA dataset. File is not zip file error. May 25, 2020

ezyang transferred this issue from pytorch/pytorch May 26, 2020

fmassa added duplicate help wanted module: datasets labels May 29, 2020

pmeier mentioned this issue Jun 15, 2020

add descriptive error message if Google Drive quota is exceeded #2321

Merged

fmassa closed this as completed in #2321 Jul 3, 2020

lucaslingle mentioned this issue Apr 22, 2021

Unable to load CelebA dataset: "File is not zip file" error. #3708

Closed

mohammad-brdrn mentioned this issue May 12, 2021

Example of classification and regression GRAAL-Research/poutyne#110

Merged

urw7rs mentioned this issue Jun 28, 2021

zipfile.BadZipFile: File is not a zip file urw7rs/spiralpp#2

Closed

Hackathorn mentioned this issue Aug 7, 2021

Unable to load CelebA dataset. File is not zip file error. rasbt/stat453-deep-learning-ss21#4

Open

pmeier mentioned this issue Mar 30, 2022

improve error handling for GDrive downloads #5704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load CelebA dataset. File is not zip file error. #2262

Unable to load CelebA dataset. File is not zip file error. #2262

ajayrfhp commented May 25, 2020

pmeier commented May 27, 2020 •

edited

Loading

fmassa commented May 29, 2020

pmeier commented May 29, 2020 •

edited

Loading

ajayrfhp commented May 31, 2020

fmassa commented Jun 1, 2020

pmeier commented Jun 1, 2020

fmassa commented Jun 1, 2020

jotterbach commented Aug 7, 2020

pmeier commented Aug 7, 2020

import-antigravity commented Mar 16, 2021

AndrewUlmer commented Mar 17, 2021

sayantanauddy commented Apr 9, 2021

FrancescoSaverioZuppichini commented Apr 16, 2021

rykovv commented Sep 29, 2021 •

edited

Loading

pmeier commented Sep 30, 2021

marzmesas commented Mar 29, 2022

xyjixyjixyji commented Jun 14, 2022

gabriben commented Jun 20, 2022 •

edited

Loading

univanxx commented Sep 2, 2022

viomirea commented Apr 10, 2023

jS5t3r commented May 5, 2023

viomirea commented May 10, 2023 •

edited

Loading

ldr7 commented Jul 9, 2023

ozturkoktay commented Sep 22, 2023

giulio98 commented Apr 21, 2024

Unable to load CelebA dataset. File is not zip file error. #2262

Unable to load CelebA dataset. File is not zip file error. #2262

Comments

ajayrfhp commented May 25, 2020

🐛 Bug

To Reproduce

Environment

pmeier commented May 27, 2020 • edited Loading

fmassa commented May 29, 2020

pmeier commented May 29, 2020 • edited Loading

ajayrfhp commented May 31, 2020

fmassa commented Jun 1, 2020

pmeier commented Jun 1, 2020

fmassa commented Jun 1, 2020

jotterbach commented Aug 7, 2020

pmeier commented Aug 7, 2020

import-antigravity commented Mar 16, 2021

AndrewUlmer commented Mar 17, 2021

sayantanauddy commented Apr 9, 2021

FrancescoSaverioZuppichini commented Apr 16, 2021

rykovv commented Sep 29, 2021 • edited Loading

pmeier commented Sep 30, 2021

marzmesas commented Mar 29, 2022

xyjixyjixyji commented Jun 14, 2022

gabriben commented Jun 20, 2022 • edited Loading

univanxx commented Sep 2, 2022

viomirea commented Apr 10, 2023

jS5t3r commented May 5, 2023

viomirea commented May 10, 2023 • edited Loading

ldr7 commented Jul 9, 2023

ozturkoktay commented Sep 22, 2023

giulio98 commented Apr 21, 2024

pmeier commented May 27, 2020 •

edited

Loading

pmeier commented May 29, 2020 •

edited

Loading

rykovv commented Sep 29, 2021 •

edited

Loading

gabriben commented Jun 20, 2022 •

edited

Loading

viomirea commented May 10, 2023 •

edited

Loading