Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"nan" losses issue for some small subset of users #151

Closed
monsieurpooh opened this issue May 7, 2022 · 10 comments
Closed

"nan" losses issue for some small subset of users #151

monsieurpooh opened this issue May 7, 2022 · 10 comments

Comments

@monsieurpooh
Copy link

Example last few iterations of a user for whom it's not working:

47it [01:22,  1.60s/it]
48it [01:24,  1.70s/it]
49it [01:26,  1.67s/it]
50it [01:27,  1.64s/it]
                       
50it [01:28,  1.64s/it]
i: 0, loss: nan, losses: nan
i: 50, loss: nan, losses: nan

Example last few iterations for my PC:

[e] 48it [00:16,  2.77it/s]
[e] 49it [00:16,  2.80it/s]
[e] 50it [00:16,  2.94it/s]
[e]                        
[e] 50it [00:17,  2.94it/s]
i: 0, loss: 0.92412, losses: 0.92412
i: 50, loss: 0.765271, losses: 0.765271

I have no clue how to even begin debugging this

@monsieurpooh
Copy link
Author

We confirmed that on his computer running the exact same code as mine, the loss always says "Nan" whereas mine is a number. I suspect localization issues/assumptions somewhere deep within the python code or libraries. Does anyone have any idea on how to begin debugging this? It doesn't repro on my machine.

@monsieurpooh
Copy link
Author

btw, the reason this is an issue is that on these machines where the program fails, the image never gets any more image-like than the seed image. It's always a blotchy seed-like image no matter how many iterations are run.

@monsieurpooh
Copy link
Author

Update: We've narrowed down the problem to something that occurs on the line

iii = perceptor.encode_image(normalize(make_cutouts(out))).float()

"out" variable is regular numbers but "iii" variable is all "nan". Will update more after adding more debugging statements and having him run the debugging again to narrow it down further.

@monsieurpooh
Copy link
Author

Something happens inside "encode_image". I dug into CLIP/clip/model.py and put debugging statements. Inside the "forward" method of VisionTransformer, there's a series of transfofrmation of the variable "x". x contains "nan" after the line self.conv1(x). Then magically it no longer has "nan" after the line x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1). Then for some inconceivable reason, it contains nan again, after the line self.transformer(x). I must reiterate this is only reproducible on the other user's machine and I can't reproduce it on my end. On my machine (and most people's machines), it never contains nan.

This is very hard to debug; I beg anyone with knowledge of this system to chime in.

@monsieurpooh monsieurpooh changed the title "nan" losses issue for some users in certain countries "nan" losses issue for some small subset of users May 9, 2022
@monsieurpooh
Copy link
Author

I've now narrowed it down to _conv_forward in torch/nn/modules/conv.py. The line of code is:

F.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)

If bias is None, the returned tensor has nan for the users who suffer from this bug. This doesn't happen for everyone; it only happens to about 1% of users.

If bias is not None, there is no nan in either case.

Possibly related is pytorch/pytorch#59439

@monsieurpooh
Copy link
Author

More updates: Calling conv2d with bias True didn't solve the issue. Neither did updating to pytorch 1.11.0.

@monsieurpooh
Copy link
Author

A possible fix might be to update above cuDNN 8.2.2. Note that by default even the absolute latest pytorch library includes something like cuDNN 8.2. So after installing pytorch, download cuDNN separately and patch in the DLL files. Will comment later with updates.

@monsieurpooh
Copy link
Author

Updating above cuDNN 8.2.2 fixed the issue!

@nerdyrodent
Copy link
Owner

Glad you got it sorted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants