-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"nan" losses issue for some small subset of users #151
Comments
We confirmed that on his computer running the exact same code as mine, the loss always says "Nan" whereas mine is a number. I suspect localization issues/assumptions somewhere deep within the python code or libraries. Does anyone have any idea on how to begin debugging this? It doesn't repro on my machine. |
btw, the reason this is an issue is that on these machines where the program fails, the image never gets any more image-like than the seed image. It's always a blotchy seed-like image no matter how many iterations are run. |
Update: We've narrowed down the problem to something that occurs on the line
"out" variable is regular numbers but "iii" variable is all "nan". Will update more after adding more debugging statements and having him run the debugging again to narrow it down further. |
Something happens inside "encode_image". I dug into CLIP/clip/model.py and put debugging statements. Inside the "forward" method of VisionTransformer, there's a series of transfofrmation of the variable "x". x contains "nan" after the line This is very hard to debug; I beg anyone with knowledge of this system to chime in. |
I've now narrowed it down to _conv_forward in torch/nn/modules/conv.py. The line of code is:
If bias is None, the returned tensor has nan for the users who suffer from this bug. This doesn't happen for everyone; it only happens to about 1% of users. If bias is not None, there is no nan in either case. Possibly related is pytorch/pytorch#59439 |
More updates: Calling conv2d with bias True didn't solve the issue. Neither did updating to pytorch 1.11.0. |
A possible fix might be to update above cuDNN 8.2.2. Note that by default even the absolute latest pytorch library includes something like cuDNN 8.2. So after installing pytorch, download cuDNN separately and patch in the DLL files. Will comment later with updates. |
Updating above cuDNN 8.2.2 fixed the issue! |
Glad you got it sorted! |
Example last few iterations of a user for whom it's not working:
Example last few iterations for my PC:
I have no clue how to even begin debugging this
The text was updated successfully, but these errors were encountered: