You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a NumpyMetric class for an involved metric that requires numpy operations; however, the metric fails when training on multiple GPUs. After some debugging, this appears to be due to the resulting tensor not being mapped back to the appropriate GPU (or any GPU for that matter).
To Reproduce
Steps to reproduce the behavior:
Define a NumpyMetric class
classMyNumpyMetric(NumpyMetric):
defforward(self, y_hat, y):
# complicated numpy stuff (no calls to .cpu() or .cuda() or .to() or anything like that)returnmetric
Instantiate it in the __init__ and validation_step of my PyTorchLightning module, e.g.,
classMyNetwork(pl.LightningModule):
def__init__(self, args):
# other init stuffself.my_metric=MyNumpyMetric('my_metric')
defvalidation_step(self, batch, batch_idx):
# other validation stuffmy_metric=self.my_metric(y_hat, y) # where y_hat and y are tensors, no .cpu(), .cuda(), .to() called on eitherout_dict=dict(val_my_metric=my_metric)
returnout_dict
I expected no error to occur. The documentation states: "[NumpyMetric] already handles DDP sync and input/output conversions." However, this doesn't appear to be the case in my implementation.
I presume, however, that this is not the intended way to use the NumpyMetric class.
FWIW, I briefly looked at the code to see if I could just submit a PR with the fix (if this isn't user error), but it wasn't clear to me where the best places to look were. If you point me in the right direction, I might be able to submit a PR with the fix.
The text was updated successfully, but these errors were encountered:
Hi, good news, I believe I have fixed this issue already in #2657, at least it looks very similar. The fix is not released yet (but soon). So if you need it now, install Lightning from master branch. (your workaround is also fine)
🐛 Bug
I created a NumpyMetric class for an involved metric that requires numpy operations; however, the metric fails when training on multiple GPUs. After some debugging, this appears to be due to the resulting tensor not being mapped back to the appropriate GPU (or any GPU for that matter).
To Reproduce
Steps to reproduce the behavior:
__init__
andvalidation_step
of my PyTorchLightning module, e.g.,Code sample
I will try to do this soon.
Expected behavior
I expected no error to occur. The documentation states: "[NumpyMetric] already handles DDP sync and input/output conversions." However, this doesn't appear to be the case in my implementation.
Environment
Additional context
PyTorch and PyTorch Lightning were installed with conda (along with all of the other packages).
I was able to work around this error by adding the following
.to()
call to the validation step:I presume, however, that this is not the intended way to use the NumpyMetric class.
FWIW, I briefly looked at the code to see if I could just submit a PR with the fix (if this isn't user error), but it wasn't clear to me where the best places to look were. If you point me in the right direction, I might be able to submit a PR with the fix.
The text was updated successfully, but these errors were encountered: