-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using cached metrics in consecutive test_card runs can be faulty #884
Comments
There seems to be a cache of artififacts: It was introduced to improve runtime (e.g. for metrics that download models), but still, it’s error prone if artifacts are not immutable. Maybe caching should be enabled explicitly. |
Thanks @antonioan, that is indeed a very intersting behaviour. Can you write a minimal code that replicate it so i can have a look? |
@antonioan showed this to me. Basically, you just need to have a variable in the metrics (even a count of processed instances variables). If you run two cards, the value from the counter in the first card will be used in the second card as well. |
Here is a small piece of code, that contains the important relevant parts of the issue: from unitxt.metrics import Metric
from unitxt.operators import ArtifactFetcherMixin
from unitxt.test_utils.metrics import apply_metric
predictions = ["A", "B", "C"]
references = [["B", "C"], ["A"], ["B", "C"]]
metric = ArtifactFetcherMixin.get_artifact("metrics.accuracy")
assert isinstance(metric, Metric)
metric.score_prefix = "my_"
outputs = apply_metric(
metric=metric, predictions=predictions, references=references
)
print(outputs[0]["score"]) <--- prints score named my_accuracy, as metric was explicitly set
metric = ArtifactFetcherMixin.get_artifact("metrics.accuracy")
assert isinstance(metric, Metric)
outputs = apply_metric(
metric=metric, predictions=predictions, references=references
)
print(outputs[0]["score"]) <--- prints score named my_accuracy, although metric was gotten fresh from ArtifactFetcherMixin |
I noticed that a consecutive
test_card
is using the same instance of the metric used by the previous call (for a different card).This is problematic for me, because the common metric that I'm using (inheriting from
BulkInstanceMetric
) dynamically sets the fields inself.reduction_map["mean"]
during the execution ofcompute()
. So upon runningtest_card
on the second card, the metric score reduction fails because it sees a field it doesn't recognize (but the previous card does; it had been added by that card).It's interesting though that the second run does call
prepare()
on a new metric instance, but then runscompute()
on the same previous one, as was caught by my debug prints:The text was updated successfully, but these errors were encountered: