[Bug]: AnomalyScoreThreshold is incompatible with multi-GPU training #1398

Seanny123 · 2023-10-10T18:39:16Z

Describe the bug

Trying to do multi-GPU training of fastflow by setting the config strategy: ddp and optimizer: gpu, I get the error:

Traceback (most recent call last):
  File "/home/sean/combinedpipe/run_anomalib.py", line 40, in <module>
    trainer.fit(model=model, datamodule=datamodule)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
    self._run_validation()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
    self.val_loop.run()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 206, in run
    output = self.on_run_end()
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 180, in on_run_end
    self._evaluation_epoch_end(self._outputs)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 288, in _evaluation_epoch_end
    self.trainer._call_lightning_module_hook(hook_name, output_or_outputs)
  File "/home/sean/sean/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/sean/anomalib/src/anomalib/models/components/base/anomaly_module.py", line 145, in validation_epoch_end
    self._compute_adaptive_threshold(outputs)
  File "/home/sean/anomalib/src/anomalib/models/components/base/anomaly_module.py", line 162, in _compute_adaptive_threshold
    self.image_threshold.compute()
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 529, in wrapped_func
    with self.sync_context(
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 500, in sync_context
    self.sync(
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 452, in sync
    self._sync_dist(dist_sync_fn, process_group=process_group)
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/metric.py", line 364, in _sync_dist
    output_dict = apply_to_collection(
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 203, in apply_to_collection
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 203, in <dictcomp>
    return elem_type({k: apply_to_collection(v, dtype, function, *args, **kwargs) for k, v in data.items()})
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 209, in apply_to_collection
    return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 209, in <listcomp>
    return elem_type([apply_to_collection(d, dtype, function, *args, **kwargs) for d in data])
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/data.py", line 199, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/sean/sean/lib/python3.10/site-packages/torchmetrics/utilities/distributed.py", line 131, in gather_all_tensors
    torch.distributed.all_gather(local_sizes, local_size, group=group)
  File "/home/sean/sean/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/home/sean/sean/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2450, in all_gather
    work = group.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

Getting around the RuntimeError: Tensors must be CUDA and dense error by removing all .cpu() calls in src/anomalib/models/components/base/anomaly_module.py results in image_F1Score being 0.0 during both testing and validation.

Why is AnomalyScoreThreshold incompatible multi-GPU training and how could it be modified to be compatible?

Dataset

Other (please specify in the text field below)

Model

FastFlow

Steps to reproduce the behavior

See bug description.

OS information

OS information:

OS: Ubuntu 22.04.03
Python version: 3.10.12
Anomalib version: main on Github
PyTorch version: 2.0.1
CUDA/cuDNN version: 12.2
GPU models and configuration: 2x NVIDIA RTX 6000 Ada
Any other relevant information: I'm using the hazelnut toy dataset

Expected behavior

I expected to be able to do multi-GPU training using Fastflow and for the F1 score to be non-zero.

Screenshots

No response

Pip/GitHub

GitHub

What version/branch did you use?

main

Configuration YAML

See bug description.

Logs

See bug description.

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

wsj20010128 · 2023-10-11T04:04:22Z

I had the same issue with EfficientAD, but it seems the bug has not been fixed yet.

max82645235 · 2023-10-12T01:36:47Z

yes , I had the same issue with PatchCore when use multi-GPU training

ashwinvaidya17 · 2023-10-12T06:54:57Z

This is something we need to address. We move the metrics to CPU for computation to avoid OOM on GPU. But this breaks multi-gpu training. For now the solution is to remove all the calls that move outputs and metrics to CPU. We plan to make this configurable in the future.

Seanny123 · 2023-10-12T11:06:31Z

@ashwinvaidya17 as I note in my original post, removing all the calls that move outputs to the CPU does not resolve the problem:

Getting around the RuntimeError: Tensors must be CUDA and dense error by removing all .cpu() calls in src/anomalib/models/components/base/anomaly_module.py results in image_F1Score being 0.0 during both testing and validation.

ashwinvaidya17 · 2023-10-12T12:43:38Z

We also have these lines here https://github.com/openvinotoolkit/anomalib/blob/main/src/anomalib/utils/callbacks/post_processing_configuration.py#L75. Did you try removing these as well?

Seanny123 · 2023-10-12T14:20:49Z

Yes, when I remove these, the metric stops computing. image_F1Score is forever 0.0 which is unhelpful

Seanny123 · 2023-10-14T13:29:35Z

FYI, a different way to get around the RuntimeError: Tensors must be CUDA and dense is to add:

from pytorch_lightning.utilities.distributed import gather_all_tensors


def all_gather_on_cuda(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> list[torch.Tensor]:
    original_device = tensor.device
    return [
        _tensor.to(original_device)
        for _tensor in gather_all_tensors(tensor.to("cuda"), *args, **kwargs)
    ]


class AnomalyScoreThreshold(PrecisionRecallCurve):

to anomaly_score_threshold.py.

pankajmishra000 · 2023-10-16T10:55:47Z

this is still not fixed. I am still getting the same error. any suggestions please.
I already removed all the "to_cpu()" command. And I added the above-mentioned lines of code.

pankajmishra000 · 2023-10-16T10:56:05Z

    work = group.allgather([tensor_list], [tensor])
RuntimeError: No backend type associated with device type cpu

max82645235 · 2023-10-17T06:47:25Z

FYI, a different way to get around the RuntimeError: Tensors must be CUDA and dense is to add:

from pytorch_lightning.utilities.distributed import gather_all_tensors


def all_gather_on_cuda(tensor: torch.Tensor, *args: Any, **kwargs: Any) -> list[torch.Tensor]:
    original_device = tensor.device
    return [
        _tensor.to(original_device)
        for _tensor in gather_all_tensors(tensor.to("cuda"), *args, **kwargs)
    ]


class AnomalyScoreThreshold(PrecisionRecallCurve):

to anomaly_score_threshold.py.

I tried to use this method, although no errors were reported, the result could be produced, but the F1 and and Auc accuracy became very low

Seanny123 · 2023-10-17T16:37:27Z

That's interesting, because it worked for me on two datasets, but then had a major accuracy drop on a third dataset, so I assumed there was something wrong with my third, new, larger dataset. 🤔

This was referenced Oct 13, 2023

[Task]: Problen with multi GPU training #1410

Closed

Enable multi-gpu AnomalyScoreThreshold #1413

Open

samet-akcay mentioned this issue Feb 9, 2024

✨ Add Multi-GPU Support to v1.1 #1449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: AnomalyScoreThreshold is incompatible with multi-GPU training #1398

[Bug]: AnomalyScoreThreshold is incompatible with multi-GPU training #1398

Seanny123 commented Oct 10, 2023

wsj20010128 commented Oct 11, 2023

max82645235 commented Oct 12, 2023

ashwinvaidya17 commented Oct 12, 2023

Seanny123 commented Oct 12, 2023

ashwinvaidya17 commented Oct 12, 2023

Seanny123 commented Oct 12, 2023

Seanny123 commented Oct 14, 2023

pankajmishra000 commented Oct 16, 2023 •

edited

Loading

pankajmishra000 commented Oct 16, 2023

max82645235 commented Oct 17, 2023

Seanny123 commented Oct 17, 2023

[Bug]: AnomalyScoreThreshold is incompatible with multi-GPU training #1398

[Bug]: AnomalyScoreThreshold is incompatible with multi-GPU training #1398

Comments

Seanny123 commented Oct 10, 2023

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

wsj20010128 commented Oct 11, 2023

max82645235 commented Oct 12, 2023

ashwinvaidya17 commented Oct 12, 2023

Seanny123 commented Oct 12, 2023

ashwinvaidya17 commented Oct 12, 2023

Seanny123 commented Oct 12, 2023

Seanny123 commented Oct 14, 2023

pankajmishra000 commented Oct 16, 2023 • edited Loading

pankajmishra000 commented Oct 16, 2023

max82645235 commented Oct 17, 2023

Seanny123 commented Oct 17, 2023

pankajmishra000 commented Oct 16, 2023 •

edited

Loading