validation_epoch_end needs to return CUDA tensors #2442

Anjum48 · 2020-07-01T06:56:52Z

🐛 Bug

I'm not sure if this is expected behaviour or not, but upgrading to the latest version (from 0.8.1) caused my validation_epoch_end to break. It appears that a CUDA tensor is expected for the metric where before the tensor was device agnostic.

This was using sklearn's roc_auc_score. I haven't yet got around to testing PL's new metrics.

Feel free to close if this is expected behaviour.

To Reproduce

This is my validation_epoch_end. Uncommenting .to(avg_loss.device) allows this to run with the dev version of PL.

This was run with ddp, precision=16 using Apex AMP.

def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        y_pred = torch.cat([x["y_pred"].squeeze() for x in outputs]).cpu().numpy()
        y_true = torch.cat([x["y_true"].squeeze() for x in outputs]).cpu().numpy()

        metric = torch.tensor(roc_auc_score(y_true, y_pred))  # .to(avg_loss.device)

        tensorboard_logs = {
            "loss/validation": avg_loss,
            "auc": metric,
        }
        return {"val_loss": avg_loss, "log": tensorboard_logs, "auc": metric}

The error message can be seen here: #2411 (comment)

Code sample

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://github.com/raw/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0): 1.5
OS (e.g., Linux): Ubuntu 20.04
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version: 3.8.2
CUDA/cuDNN version: 10.2
GPU models and configuration: Dual GPU (ddp)
Any other relevant information: Apex AMP

Additional context

The text was updated successfully, but these errors were encountered:

s-rog · 2020-07-01T07:11:15Z

I also have a sklearn generated metric that is only logged, with no business being on gpu. Hopefully not intended behavior?

Borda · 2020-07-01T09:57:44Z

@justusschock @SkafteNicki mind have look? 🐰

justusschock · 2020-07-02T06:41:19Z

@Anjum48 this is probably caused by #2434 in combination with your dip backend (from here: https://pytorch.org/docs/stable/distributed.html#backends) which does not support CPU tensors (are you using NCCL backend?)

@s-rog can you elaborate on your problem? Maybe in a separate issue since I'm not sure, if this is the same problem here.

cc @williamFalcon ^^

Anjum48 · 2020-07-02T07:52:20Z

Hi @justusschock, I'm using the default PyTorch install from conda, so I believe NCCL is being used (both GPUs are in the same machine). The weird thing is that this only started happening when I upgraded from 0.8.1 to 0.8.4 so I guess something has changed in ddp?

justusschock · 2020-07-02T07:56:58Z

This is not on ddp itself, but we added a often requested feature and synced the outputs across ddp nodes. And since you probably used nccl backend (which only supports gpu), this probably causes the drawback. Can you try to switch the backend of DDP? Maybe to Gloo or MPI?

Anjum48 · 2020-07-02T08:22:09Z

I added torch.distributed.Backend('gloo') to the top of my training script and am seeing the same error

s-rog · 2020-07-02T08:24:14Z

@justusschock I have the same issue as OP where my validation_epoch_end returns a cpu tensor like:
return {'log':{'some_sklearn_metric':a_cpu_tensor}}

Edit:
does this mean that the previous implementation was only returning/logging the rank 0 output?

Borda · 2020-07-02T08:28:45Z

Hi @justusschock, I'm using the default PyTorch install from conda, so I believe NCCL is being used (both GPUs are in the same machine). The weird thing is that this only started happening when I upgraded from 0.8.1 to 0.8.4 so I guess something has changed in ddp?

FYI, We are testing also against Conda PyTorch distributions

Anjum48 · 2020-07-02T08:29:20Z

does this mean that the previous implementation was only returning/logging the rank 0 output?

I was also wondering this. I noticed my validation loss double after upgrading to 0.8.4

justusschock · 2020-07-02T10:20:56Z

It is synced/reduced back to process 0 and only logged there :) DO you think it would be better to log from each process with the respective rank as a prefix?

justusschock · 2020-07-02T10:23:15Z

@Anjum48 regarding your persisting error:
adding it at the top of your script doesn't help, since lightning trainer overwrites this. You have to set it in the trainer: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L166

Anjum48 · 2020-07-02T11:11:06Z

Ok I tried it in trainer, and got this error (after disabling AMP):

Loaded pretrained weights for efficientnet-b0
train.py:141: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  train_single_fold(args)
GPU available: True, used: False
TPU available: False, using: 0 TPU cores

  | Name      | Type             | Params
-----------------------------------------------
0 | critereon | CrossEntropyLoss | 0     
1 | metric    | AUROC            | 0     
2 | net       | EfficientNet     | 4 M   
Traceback (most recent call last):
  File "train.py", line 141, in <module>
    train_single_fold(args)
  File "train.py", line 65, in train_single_fold
    trainer.fit(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in fit
    self.run_pretrain_routine(model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1128, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 342, in reset_val_dataloader
    self._reset_eval_dataloader(model, 'val')
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 268, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py", line 363, in request_dataloader
    dataloader = dataloader_fx()
  File "/home/anjum/PycharmProjects/kaggle/siim_isic_melanoma_classification/engine.py", line 194, in val_dataloader
    sampler = DistributedSampler(dataset)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/utils/data/distributed.py", line 43, in __init__
    num_replicas = dist.get_world_size()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 582, in get_world_size
    return _get_group_size(group)
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 196, in _get_group_size
    _check_default_pg()
  File "/home/anjum/anaconda3/envs/kaggle/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 186, in _check_default_pg
    assert _default_pg is not None, \
AssertionError: Default process group is not initialized

I'm not familiar with gloo so I don't know what I'm looking at here. Isn't gloo for CPU only training?

williamFalcon · 2020-07-02T11:19:45Z

fixed on master for now.

Yes, currently only rank 0 is used for val calculation. The problem is that reducing blows up memory and causes these kinds of issues.

however, this will be enabled again in an upcoming PR with an object that can do this kind of syncing automatically. however we’ll need to solve this problem again, so we still need to identify what the issue is.

s-rog · 2020-07-03T00:24:45Z

It is synced/reduced back to process 0 and only logged there :)

Yes, currently only rank 0 is used for val calculation.

@justusschock @williamFalcon ming clarifying? these two statements seem to be contradicting, unless I'm misunderstanding something...

williamFalcon · 2020-07-03T05:08:49Z

Originally we didn't reduce on val - it was on our todo. Then we added it, but it had weird issues such as this one. So we removed it and will test it more before adding back in.

s-rog · 2020-07-07T08:21:47Z

I suppose the metric package fixes/bypasses this issue?

I just converted my vanilla sk metrics to the pl sk interface, will test soon.

Edit:

  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/metrics/metric.py", line 145, in __call__
    return apply_to_collection(self._orig_call(*args, **kwargs), torch.Tensor,
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/metrics/converters.py", line 67, in new_func
    return func_to_apply(result, *dec_args, **dec_kwargs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_func.py", line 35, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/metrics/converters.py", line 252, in _sync_ddp_if_available
    async_op=False)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

same issue using the built in metric interface it seems... I'm guessing the metrics need to include cuda conversion (for now) to be able to be reduced?

justusschock · 2020-07-07T09:58:56Z

are you on master?

We won't include a cuda conversion here (we already have a push to self.device), since this also depends on your distributed backend (and you don't have access tp/want the computation to run on a gpu all the times.

s-rog · 2020-07-08T00:23:48Z

@justusschock I just updated and tried again with the same error

justusschock · 2020-07-08T09:05:06Z

do you have any CPU tensors there?

s-rog · 2020-07-09T00:26:35Z

nope both inputs to the lightning sklearn metric are cuda tensors

justusschock · 2020-07-09T07:55:11Z

I think I know the issue. Currently there is a problem with device propagation in utilities.device_dtype_mixin which causes the device property to be set only for the most outer instance. E.g. when you assign the metric to your LightningModule, the device of your metric isn't updated. So far I don't know why though.

Any Idea @Borda?

Borda · 2020-07-15T12:34:55Z

not sure at this moment, will check it...

s-rog · 2020-07-16T02:21:34Z

Should this be a separate issue? as it's an issue with metrics and not validation_epoch_end specifically

ydcjeff · 2020-07-21T13:39:16Z

Just some findings tho

Currently, classification metrics return cpu tensor while regression metrics return cuda tensor, haven't tested with sk-metrics.

COLAB EXAMPLE

awaelchli · 2020-07-21T13:39:29Z

@justusschock The problem is simply that when you call .to on the nn.Module, it is not calling .to on the submodules. What the pytorch nn module does instead is only move the tensors (buffers, parameters) of these submodules, since pytorch has no such thing as a device property, the device is simply defined by the device the tensors are on.

We can come up with a creative solution that involves going over all submodules and checking if they are instance of dtypedevicemixin and calling .to on these

But this is not a real solution because if you nest multiple levels of dtype-mixin-subclass and nn.Module in a particular way, it can still lead to the same bug

EDIT: I think I have a fix that involves using the apply function on the module, and can submit a PR.

awaelchli · 2020-07-21T19:41:23Z

@Anjum48 the bug should be fixed now, although I was not able to check your use case with precision=16. Let me know if it works or not.

s-rog · 2020-07-22T03:28:36Z

@awaelchli pulled from master and still experiencing the same issue with FP16

justusschock · 2020-07-22T07:54:00Z

Yes, but this is probably something different. If you're using AMP, they don't convert inputs and outputs once, but they monkey patch every function to convert it's inputs and convert it's outputs back...

Are you on slack? Maybe it's best to write me a message there, so that we can easily discuss your issue without spamming all the others :)

Anjum48 added bug Something isn't working help wanted Open to be worked on labels Jul 1, 2020

s-rog mentioned this issue Jul 1, 2020

Bringing together results on ddp on a single machine #702

Closed

williamFalcon closed this as completed Jul 2, 2020

williamFalcon reopened this Jul 2, 2020

williamFalcon closed this as completed Jul 3, 2020

Borda reopened this Jul 15, 2020

awaelchli mentioned this issue Jul 21, 2020

fix dtype/device property not getting updated in submodules #2657

Merged

7 tasks

williamFalcon closed this as completed in #2657 Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation_epoch_end needs to return CUDA tensors #2442

validation_epoch_end needs to return CUDA tensors #2442

Anjum48 commented Jul 1, 2020

s-rog commented Jul 1, 2020

Borda commented Jul 1, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

s-rog commented Jul 2, 2020 •

edited

Loading

Borda commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

justusschock commented Jul 2, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

williamFalcon commented Jul 2, 2020 •

edited

Loading

s-rog commented Jul 3, 2020

williamFalcon commented Jul 3, 2020

s-rog commented Jul 7, 2020 •

edited

Loading

justusschock commented Jul 7, 2020

s-rog commented Jul 8, 2020

justusschock commented Jul 8, 2020

s-rog commented Jul 9, 2020

justusschock commented Jul 9, 2020 •

edited

Loading

Borda commented Jul 15, 2020

s-rog commented Jul 16, 2020

ydcjeff commented Jul 21, 2020

awaelchli commented Jul 21, 2020 •

edited

Loading

awaelchli commented Jul 21, 2020

s-rog commented Jul 22, 2020

justusschock commented Jul 22, 2020

validation_epoch_end needs to return CUDA tensors #2442

validation_epoch_end needs to return CUDA tensors #2442

Comments

Anjum48 commented Jul 1, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

s-rog commented Jul 1, 2020

Borda commented Jul 1, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

s-rog commented Jul 2, 2020 • edited Loading

Borda commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

justusschock commented Jul 2, 2020

justusschock commented Jul 2, 2020

Anjum48 commented Jul 2, 2020

williamFalcon commented Jul 2, 2020 • edited Loading

s-rog commented Jul 3, 2020

williamFalcon commented Jul 3, 2020

s-rog commented Jul 7, 2020 • edited Loading

justusschock commented Jul 7, 2020

s-rog commented Jul 8, 2020

justusschock commented Jul 8, 2020

s-rog commented Jul 9, 2020

justusschock commented Jul 9, 2020 • edited Loading

Borda commented Jul 15, 2020

s-rog commented Jul 16, 2020

ydcjeff commented Jul 21, 2020

awaelchli commented Jul 21, 2020 • edited Loading

awaelchli commented Jul 21, 2020

s-rog commented Jul 22, 2020

justusschock commented Jul 22, 2020

s-rog commented Jul 2, 2020 •

edited

Loading

williamFalcon commented Jul 2, 2020 •

edited

Loading

s-rog commented Jul 7, 2020 •

edited

Loading

justusschock commented Jul 9, 2020 •

edited

Loading

awaelchli commented Jul 21, 2020 •

edited

Loading