NumpyMetric not mapping back to GPU in multi-GPU training #2808

jcreinhold · 2020-08-03T16:29:26Z

🐛 Bug

I created a NumpyMetric class for an involved metric that requires numpy operations; however, the metric fails when training on multiple GPUs. After some debugging, this appears to be due to the resulting tensor not being mapped back to the appropriate GPU (or any GPU for that matter).

To Reproduce

Steps to reproduce the behavior:

Define a NumpyMetric class

class MyNumpyMetric(NumpyMetric):
    def forward(self, y_hat, y):
        # complicated numpy stuff (no calls to .cpu() or .cuda() or .to() or anything like that)
        return metric

Instantiate it in the __init__ and validation_step of my PyTorchLightning module, e.g.,

class MyNetwork(pl.LightningModule):
    def __init__(self, args):
        # other init stuff
        self.my_metric = MyNumpyMetric('my_metric')

    def validation_step(self, batch, batch_idx):
        # other validation stuff
        my_metric = self.my_metric(y_hat, y)  # where y_hat and y are tensors, no .cpu(), .cuda(), .to() called on either
        out_dict = dict(val_my_metric=my_metric)
        return out_dict

Run:

model = MyNetwork(args)
trainer = Trainer(
    benchmark=True,
    check_val_every_n_epoch=1,
    accumulate_grad_batches=1,
    min_epochs=n_epochs,
    max_epochs=n_epochs,
    fast_dev_run=False,
    gpus=2,
    distributed_backend='dp'
)
trainer.fit(model)

See:

Traceback (most recent call last):
  File "./tiramisu3d.py", line 574, in <module>
    trainer.fit(model)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 997, in fit
    results = self.dp_train(model)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 270, in dp_train
    result = self.run_pretrain_routine(model)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 444, in evaluation_forward
    output = model(*args)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/pytorch_lightning/overrides/data_parallel.py", line 66, in forward
    return self.gather(outputs, self.output_device)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in gather_map
    return type(out)(((k, gather_map([d[k] for d in outputs]))
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 61, in <genexpr>
    return type(out)(((k, gather_map([d[k] for d in outputs]))
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/iacl/pg20/jacobr/miniconda3/envs/msseg-9.2/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))

Code sample

I will try to do this soon.

Expected behavior

I expected no error to occur. The documentation states: "[NumpyMetric] already handles DDP sync and input/output conversions." However, this doesn't appear to be the case in my implementation.

Environment

CUDA:
- GPU:
  - Tesla M40 24GB
  - Tesla M40 24GB
- available: True
- version: 9.2
Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.5.1
- pytorch-lightning: 0.8.5
- tensorboard: 2.3.0
- tqdm: 4.48.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.3
- version: Proposal for help #1 SMP Wed Sep 26 11:06:22 UTC 2018

Additional context

PyTorch and PyTorch Lightning were installed with conda (along with all of the other packages).

I was able to work around this error by adding the following .to() call to the validation step:

def validation_step(self, batch, batch_idx):
    # other validation stuff
    my_metric = self.my_metric(y_hat, y)
    my_metric = my_metric.to(y_hat.device)
    out_dict = dict(val_my_metric=my_metric)
    return out_dict

I presume, however, that this is not the intended way to use the NumpyMetric class.

FWIW, I briefly looked at the code to see if I could just submit a PR with the fix (if this isn't user error), but it wasn't clear to me where the best places to look were. If you point me in the right direction, I might be able to submit a PR with the fix.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-08-03T16:30:39Z

Hi! thanks for your contribution!, great first issue!

awaelchli · 2020-08-03T18:33:12Z

Hi, good news, I believe I have fixed this issue already in #2657, at least it looks very similar. The fix is not released yet (but soon). So if you need it now, install Lightning from master branch. (your workaround is also fine)

Borda · 2020-08-04T17:17:14Z

@jcreinhold mind try master and feel free to reopen if needed 🐰

jcreinhold added bug Something isn't working help wanted Open to be worked on labels Aug 3, 2020

awaelchli self-assigned this Aug 3, 2020

Borda closed this as completed Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NumpyMetric not mapping back to GPU in multi-GPU training #2808

NumpyMetric not mapping back to GPU in multi-GPU training #2808

jcreinhold commented Aug 3, 2020

github-actions bot commented Aug 3, 2020

awaelchli commented Aug 3, 2020 •

edited

Loading

Borda commented Aug 4, 2020

NumpyMetric not mapping back to GPU in multi-GPU training #2808

NumpyMetric not mapping back to GPU in multi-GPU training #2808

Comments

jcreinhold commented Aug 3, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Aug 3, 2020

awaelchli commented Aug 3, 2020 • edited Loading

Borda commented Aug 4, 2020

awaelchli commented Aug 3, 2020 •

edited

Loading