Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of Memory when training STFPM #15

Closed
ashwinvaidya17 opened this issue Dec 4, 2021 · 3 comments · Fixed by #64
Closed

CUDA out of Memory when training STFPM #15

ashwinvaidya17 opened this issue Dec 4, 2021 · 3 comments · Fixed by #64
Labels
Bug Something isn't working

Comments

@ashwinvaidya17
Copy link
Collaborator

Got this error when training STFPM on pill category.
Occurred when training from the benchmarking script.

Error

output = self.trainer.call_hook('validation_step_end', *args, **kwargs)                                                                        
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1230, in call_hook           
    output = hook_fx(*args, **kwargs)                                                                                                              
  File "/home/ashwin/anomalib/anomalib/core/model/anomaly_module.py", line 105, in validation_step_end                                             
    self.pixel_metrics(val_step_outputs["anomaly_maps"].flatten(), val_step_outputs["mask"].flatten().int())
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/collections.py", line 110, in forward
    return {k: m(*args, **m._filter_kwargs(**kwargs)) for k, m in self.items()}
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/collections.py", line 110, in <dictcomp>
    return {k: m(*args, **m._filter_kwargs(**kwargs)) for k, m in self.items()}
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 205, in forward
    self._forward_cache = self.compute()
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 367, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/ashwin/anomalib/anomalib/core/metrics/optimal_f1.py", line 38, in compute
    precision, recall, thresholds = self.precision_recall_curve.compute() 
File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/metric.py", line 367, in wrapped_func
    self._computed = compute(*args, **kwargs)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/classification/precision_recall_curve.py", line 148, in comp
ute
    return _precision_recall_curve_compute(preds, target, self.num_classes, self.pos_label)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 2
60, in _precision_recall_curve_compute
    return _precision_recall_curve_compute_single_class(preds, target, pos_label, sample_weights)
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 1
40, in _precision_recall_curve_compute_single_class
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/ashwin/miniconda3/envs/anomalib/lib/python3.8/site-packages/torchmetrics/functional/classification/precision_recall_curve.py", line 3
6, in _binary_clf_curve
    desc_score_indices = torch.argsort(preds, descending=True)
RuntimeError: CUDA out of memory. Tried to allocate 5.82 GiB (GPU 0; 23.70 GiB total capacity; 9.28 GiB already allocated; 2.61 GiB free; 19.41 GiB
 reserved in total by PyTorch)

Relevant Config

  • Image size = 256
@ashwinvaidya17 ashwinvaidya17 added the Bug Something isn't working label Dec 4, 2021
@samet-akcay
Copy link
Contributor

Look like the issue that @blakshma is having with the CFlow implementation is quite similar to this if not the same. @djdameln is trying to move the metric computation to the cpu, which would hopefully resolve the issue.

@blakshma
Copy link
Contributor

blakshma commented Jan 9, 2022

Should we limit number of samples used to calculate pixel level threshold? This might be a bigger problem if users are uploading 1000's of images.

@samet-akcay
Copy link
Contributor

yeah, this is on our agenda. For now this PR will move the computation to cpu.

@samet-akcay samet-akcay linked a pull request Jan 24, 2022 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants