CUDA out of Memory when training `STFPM` #15

ashwinvaidya17 · 2021-12-04T19:49:03Z

Got this error when training STFPM on pill category.
Occurred when training from the benchmarking script.

Error

output = self.trainer.call_hook('validation_step_end', *args, **kwargs)                                                                        
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1230, in call_hook           
    output = hook_fx(*args, **kwargs)                                                                                                              
  File "/home/ashwin/anomalib/anomalib/core/model/anomaly_module.py", line 105, in validation_step_end                                             
    self.pixel_metrics(val_step_outputs["anomaly_maps"].flatten(), val_step_outputs["mask"].flatten().int())
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/collections.py", line 110, in forward
    return {k: m(*args, **m._filter_kwargs(**kwargs)) for k, m in self.items()}
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/collections.py", line 110, in <dictcomp>
    return {k: m(*args, **m._filter_kwargs(**kwargs)) for k, m in self.items()}
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 205, in forward
    self._forward_cache = self.compute()
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 367, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/ashwin/anomalib/anomalib/core/metrics/optimal_f1.py", line 38, in compute
    precision, recall, thresholds = self.precision_recall_curve.compute() 
File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 367, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/classification/precision_recall_curve.py", line 148, in comp
ute
    return _precision_recall_curve_compute(preds, target, self.num_classes, self.pos_label)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 2
60, in _precision_recall_curve_compute
    return _precision_recall_curve_compute_single_class(preds, target, pos_label, sample_weights)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 1
40, in _precision_recall_curve_compute_single_class
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 3
6, in _binary_clf_curve
    desc_score_indices = torch.argsort(preds, descending=True)
RuntimeError: CUDA out of memory. Tried to allocate 5.82 GiB (GPU 0; 23.70 GiB total capacity; 9.28 GiB already allocated; 2.61 GiB free; 19.41 GiB
 reserved in total by PyTorch)

Relevant Config

Image size = 256

The text was updated successfully, but these errors were encountered:

samet-akcay · 2022-01-06T19:00:35Z

Look like the issue that @blakshma is having with the CFlow implementation is quite similar to this if not the same. @djdameln is trying to move the metric computation to the cpu, which would hopefully resolve the issue.

blakshma · 2022-01-09T17:24:56Z

Should we limit number of samples used to calculate pixel level threshold? This might be a bigger problem if users are uploading 1000's of images.

samet-akcay · 2022-01-09T17:47:51Z

yeah, this is on our agenda. For now this PR will move the computation to cpu.

ashwinvaidya17 added the Bug Something isn't working label Dec 4, 2021

samet-akcay linked a pull request Jan 24, 2022 that will close this issue

perform metric computation on cpu #64

Merged

11 tasks

samet-akcay closed this as completed Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of Memory when training `STFPM` #15

CUDA out of Memory when training `STFPM` #15

ashwinvaidya17 commented Dec 4, 2021

samet-akcay commented Jan 6, 2022

blakshma commented Jan 9, 2022

samet-akcay commented Jan 9, 2022

CUDA out of Memory when training STFPM #15

CUDA out of Memory when training STFPM #15

Comments

ashwinvaidya17 commented Dec 4, 2021

Error

Relevant Config

samet-akcay commented Jan 6, 2022

blakshma commented Jan 9, 2022

samet-akcay commented Jan 9, 2022

CUDA out of Memory when training `STFPM` #15

CUDA out of Memory when training `STFPM` #15