"nan" losses issue for some small subset of users #151

monsieurpooh · 2022-05-07T02:57:41Z

Example last few iterations of a user for whom it's not working:

47it [01:22,  1.60s/it]
48it [01:24,  1.70s/it]
49it [01:26,  1.67s/it]
50it [01:27,  1.64s/it]
                       
50it [01:28,  1.64s/it]
i: 0, loss: nan, losses: nan
i: 50, loss: nan, losses: nan

Example last few iterations for my PC:

[e] 48it [00:16,  2.77it/s]
[e] 49it [00:16,  2.80it/s]
[e] 50it [00:16,  2.94it/s]
[e]                        
[e] 50it [00:17,  2.94it/s]
i: 0, loss: 0.92412, losses: 0.92412
i: 50, loss: 0.765271, losses: 0.765271

I have no clue how to even begin debugging this

The text was updated successfully, but these errors were encountered:

monsieurpooh · 2022-05-07T18:22:37Z

We confirmed that on his computer running the exact same code as mine, the loss always says "Nan" whereas mine is a number. I suspect localization issues/assumptions somewhere deep within the python code or libraries. Does anyone have any idea on how to begin debugging this? It doesn't repro on my machine.

monsieurpooh · 2022-05-08T03:02:05Z

btw, the reason this is an issue is that on these machines where the program fails, the image never gets any more image-like than the seed image. It's always a blotchy seed-like image no matter how many iterations are run.

monsieurpooh · 2022-05-08T05:16:30Z

Update: We've narrowed down the problem to something that occurs on the line

iii = perceptor.encode_image(normalize(make_cutouts(out))).float()

"out" variable is regular numbers but "iii" variable is all "nan". Will update more after adding more debugging statements and having him run the debugging again to narrow it down further.

monsieurpooh · 2022-05-08T23:00:29Z

Something happens inside "encode_image". I dug into CLIP/clip/model.py and put debugging statements. Inside the "forward" method of VisionTransformer, there's a series of transfofrmation of the variable "x". x contains "nan" after the line self.conv1(x). Then magically it no longer has "nan" after the line x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1). Then for some inconceivable reason, it contains nan again, after the line self.transformer(x). I must reiterate this is only reproducible on the other user's machine and I can't reproduce it on my end. On my machine (and most people's machines), it never contains nan.

This is very hard to debug; I beg anyone with knowledge of this system to chime in.

monsieurpooh · 2022-05-09T04:19:08Z

I've now narrowed it down to _conv_forward in torch/nn/modules/conv.py. The line of code is:

F.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)

If bias is None, the returned tensor has nan for the users who suffer from this bug. This doesn't happen for everyone; it only happens to about 1% of users.

If bias is not None, there is no nan in either case.

Possibly related is pytorch/pytorch#59439

monsieurpooh · 2022-05-10T19:45:19Z

More updates: Calling conv2d with bias True didn't solve the issue. Neither did updating to pytorch 1.11.0.

monsieurpooh · 2022-05-11T06:27:42Z

Update: It might be related to:
pytorch/pytorch#58123
openai/glide-text2im#31
https://discuss.pytorch.org/t/half-precision-convolution-cause-nan-in-forward-pass/117358/3
pytorch/pytorch#69449
ultralytics/yolov5#5815

monsieurpooh · 2022-05-11T07:05:03Z

A possible fix might be to update above cuDNN 8.2.2. Note that by default even the absolute latest pytorch library includes something like cuDNN 8.2. So after installing pytorch, download cuDNN separately and patch in the DLL files. Will comment later with updates.

monsieurpooh · 2022-05-12T04:29:25Z

Updating above cuDNN 8.2.2 fixed the issue!

nerdyrodent · 2022-05-20T18:37:18Z

Glad you got it sorted!

monsieurpooh changed the title ~~"nan" losses issue for some users in certain countries~~ "nan" losses issue for some small subset of users May 9, 2022

monsieurpooh mentioned this issue May 11, 2022

'NAN' in model features pytorch/pytorch#69449

Closed

nerdyrodent closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"nan" losses issue for some small subset of users #151

"nan" losses issue for some small subset of users #151

monsieurpooh commented May 7, 2022

monsieurpooh commented May 7, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 9, 2022

monsieurpooh commented May 10, 2022

monsieurpooh commented May 11, 2022 •

edited

Loading

monsieurpooh commented May 11, 2022

monsieurpooh commented May 12, 2022

nerdyrodent commented May 20, 2022

"nan" losses issue for some small subset of users #151

"nan" losses issue for some small subset of users #151

Comments

monsieurpooh commented May 7, 2022

monsieurpooh commented May 7, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 8, 2022

monsieurpooh commented May 9, 2022

monsieurpooh commented May 10, 2022

monsieurpooh commented May 11, 2022 • edited Loading

monsieurpooh commented May 11, 2022

monsieurpooh commented May 12, 2022

nerdyrodent commented May 20, 2022

monsieurpooh commented May 11, 2022 •

edited

Loading