Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about AdaptiveThreshold, possible bug? #457

Closed
davidebaltieri31 opened this issue Jul 27, 2022 · 7 comments · Fixed by #527
Closed

question about AdaptiveThreshold, possible bug? #457

davidebaltieri31 opened this issue Jul 27, 2022 · 7 comments · Fixed by #527
Assignees

Comments

@davidebaltieri31
Copy link

davidebaltieri31 commented Jul 27, 2022

I'm running anomalib on a custom dataset (around 20k samples) and I've noticed costant crashes after a few epochs/validation steps (DefaultCPUAllocator: can't allocate memory...) almost independently of the modeI I choose.

It seem to be related to AdaptiveThreshold.
In it PrecisionRecallCurve is used, and after every validation step new data is added to it, old data is never deleted.
Why?
Shouldn't data related to the previous validation steps be deleted?

@ORippler
Copy link
Contributor

ORippler commented Aug 2, 2022

Since PrecisionRecallCurve works on all data/is computed on the complete data by default in torchmetrics, you will have to hoid all your datapoints in memory. For large datasets, you can look at the binned PR-Curve, implemented for constant-memory requirement by torchmetrics.

@davidebaltieri31
Copy link
Author

I think my post wasn't clear and I used the term "step" incorrectly.
I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs?
I.e. I run training and validation after each epoch:
Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc.
which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

@ORippler
Copy link
Contributor

ORippler commented Aug 3, 2022

I think my post wasn't clear and I used the term "step" incorrectly. I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs? I.e. I run training and validation after each epoch: Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc. which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

Ahh I see. At a quick glance it seems as if the reset method is missing from AdaptiveThreshold.

Can you try to add:

def reset(self):
    self.precision_recall_curve.reset()

to the class and see if that fixes the bug/problem? Class can be found here. overall class should look as follows:

class AdaptiveThreshold(Metric):
    """Optimal F1 Metric.

    Compute the optimal F1 score at the adaptive threshold, based on the F1 metric of the true labels and the
    predicted anomaly scores.
    """

    def __init__(self, default_value: float = 0.5, **kwargs):
        super().__init__(**kwargs)

        self.precision_recall_curve = PrecisionRecallCurve(num_classes=1)
        self.add_state("value", default=torch.tensor(default_value), persistent=True)  # pylint: disable=not-callable
        self.value = torch.tensor(default_value)  # pylint: disable=not-callable

    # pylint: disable=arguments-differ
    def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:  # type: ignore
        """Update the precision-recall curve metric."""
        self.precision_recall_curve.update(preds, target)

    def compute(self) -> torch.Tensor:
        """Compute the threshold that yields the optimal F1 score.

        Compute the F1 scores while varying the threshold. Store the optimal
        threshold as attribute and return the maximum value of the F1 score.

        Returns:
            Value of the F1 score at the optimal threshold.
        """
        precision: torch.Tensor
        recall: torch.Tensor
        thresholds: torch.Tensor

        precision, recall, thresholds = self.precision_recall_curve.compute()
        f1_score = (2 * precision * recall) / (precision + recall + 1e-10)
        if thresholds.dim() == 0:
            # special case where recall is 1.0 even for the highest threshold.
            # In this case 'thresholds' will be scalar.
            self.value = thresholds
        else:
            self.value = thresholds[torch.argmax(f1_score)]
        return self.value
    # ADDED METHOD
    def reset(self):
        self.precision_recall_curve.reset()

Note: Same will be necessary for OptimalF1 also

@davidebaltieri31
Copy link
Author

Hi,
added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

@ORippler
Copy link
Contributor

ORippler commented Aug 4, 2022

Hi, added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

Did you install anomalib with the -e flag, i.e. via pip install -e?

If so, please provide a MWE so that the bug can be localized.

@jpcbertoldo
Copy link
Contributor

Shouldn't this one be closed?

@djdameln
Copy link
Contributor

I believe that the original issue is not yet resolved. While the reset method has been added to the AdaptiveThreshold metric class, the method is never called. This means that for models that need to be trained for multiple epochs, the data from the previous epochs is never cleared, which in turn may lead to out of memory errors. It could also lead to inaccurate computation of the adaptive threshold. We will investigate this a bit further and work on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants