-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validation_epoch_end needs to return CUDA tensors #2442
Comments
I also have a sklearn generated metric that is only logged, with no business being on gpu. Hopefully not intended behavior? |
@justusschock @SkafteNicki mind have look? 🐰 |
@Anjum48 this is probably caused by #2434 in combination with your dip backend (from here: https://pytorch.org/docs/stable/distributed.html#backends) which does not support CPU tensors (are you using NCCL backend?) @s-rog can you elaborate on your problem? Maybe in a separate issue since I'm not sure, if this is the same problem here. cc @williamFalcon ^^ |
Hi @justusschock, I'm using the default PyTorch install from conda, so I believe NCCL is being used (both GPUs are in the same machine). The weird thing is that this only started happening when I upgraded from 0.8.1 to 0.8.4 so I guess something has changed in ddp? |
This is not on ddp itself, but we added a often requested feature and synced the outputs across ddp nodes. And since you probably used nccl backend (which only supports gpu), this probably causes the drawback. Can you try to switch the backend of DDP? Maybe to Gloo or MPI? |
I added |
@justusschock I have the same issue as OP where my Edit: |
FYI, We are testing also against Conda PyTorch distributions |
I was also wondering this. I noticed my validation loss double after upgrading to 0.8.4 |
It is synced/reduced back to process 0 and only logged there :) DO you think it would be better to log from each process with the respective rank as a prefix? |
@Anjum48 regarding your persisting error: |
Ok I tried it in trainer, and got this error (after disabling AMP):
I'm not familiar with gloo so I don't know what I'm looking at here. Isn't gloo for CPU only training? |
fixed on master for now. Yes, currently only rank 0 is used for val calculation. The problem is that reducing blows up memory and causes these kinds of issues. however, this will be enabled again in an upcoming PR with an object that can do this kind of syncing automatically. however we’ll need to solve this problem again, so we still need to identify what the issue is. |
@justusschock @williamFalcon ming clarifying? these two statements seem to be contradicting, unless I'm misunderstanding something... |
Originally we didn't reduce on val - it was on our todo. Then we added it, but it had weird issues such as this one. So we removed it and will test it more before adding back in. |
I suppose the metric package fixes/bypasses this issue? I just converted my vanilla sk metrics to the pl sk interface, will test soon. Edit:
same issue using the built in metric interface it seems... I'm guessing the metrics need to include cuda conversion (for now) to be able to be reduced? |
are you on master? We won't include a cuda conversion here (we already have a push to self.device), since this also depends on your distributed backend (and you don't have access tp/want the computation to run on a gpu all the times. |
@justusschock I just updated and tried again with the same error |
do you have any CPU tensors there? |
nope both inputs to the lightning sklearn metric are cuda tensors |
I think I know the issue. Currently there is a problem with device propagation in Any Idea @Borda? |
not sure at this moment, will check it... |
Should this be a separate issue? as it's an issue with metrics and not validation_epoch_end specifically |
Just some findings tho Currently, classification metrics return |
@justusschock The problem is simply that when you call .to on the nn.Module, it is not calling .to on the submodules. What the pytorch nn module does instead is only move the tensors (buffers, parameters) of these submodules, since pytorch has no such thing as a device property, the device is simply defined by the device the tensors are on. We can come up with a creative solution that involves going over all submodules and checking if they are instance of dtypedevicemixin and calling .to on these But this is not a real solution because if you nest multiple levels of dtype-mixin-subclass and nn.Module in a particular way, it can still lead to the same bug EDIT: I think I have a fix that involves using the apply function on the module, and can submit a PR. |
@Anjum48 the bug should be fixed now, although I was not able to check your use case with precision=16. Let me know if it works or not. |
@awaelchli pulled from master and still experiencing the same issue with FP16 |
Yes, but this is probably something different. If you're using AMP, they don't convert inputs and outputs once, but they monkey patch every function to convert it's inputs and convert it's outputs back... Are you on slack? Maybe it's best to write me a message there, so that we can easily discuss your issue without spamming all the others :) |
🐛 Bug
I'm not sure if this is expected behaviour or not, but upgrading to the latest version (from 0.8.1) caused my
validation_epoch_end
to break. It appears that a CUDA tensor is expected for the metric where before the tensor was device agnostic.This was using sklearn's
roc_auc_score
. I haven't yet got around to testing PL's new metrics.Feel free to close if this is expected behaviour.
To Reproduce
This is my
validation_epoch_end
. Uncommenting.to(avg_loss.device)
allows this to run with the dev version of PL.This was run with
ddp
,precision=16
using Apex AMP.The error message can be seen here: #2411 (comment)
Code sample
Expected behavior
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source):Additional context
The text was updated successfully, but these errors were encountered: