Add `combine` to compose multiple evaluations #150

lvwerra · 2022-06-17T14:53:47Z

This PR adds the EvaluationEnsemble class that lets users combine several EvaluationModules into a single, compact object that behaves like a single module. Here's how it works:

from evaluate import EvaluationEnsemble

clf_metrics = EvaluationEnsemble(["accuracy", "f1", "precision","recall"])
clf_metrics.compute(predictions=[0,1], references=[1,1])

{
    'accuracy': 0.5, 
    'f1': 0.666, 
    'precision': 1.0,
    'recall': 0.5
}

Under the hood the add, add_batch, and compute function iteratively call the methods with the same name for each module.

There are a few scenarios for the returned dict:

if there are name clashes in the score names the module name is prefixed
if there are name clashes in the modules the prefixes are additionally enumerated
one can force the prefix with force_prefix=True at initialization
for the name the module names are used except if the modules are passed as a dict in which case the keys are used.

Here an example to illustrate forced prefixing with custom names:

bleu_metrics = EvaluationEnsemble({
    "bleu_simple": evaluate.load("bleu"),
    "bleu_complex": evaluate.load("bleu")}
)

{
 'bleu_simple_bleu': 0.0,
 'bleu_simple_precisions': [0.75, 0.5, 0.0, 0.0],
 'bleu_simple_brevity_penalty': 1.0,
 'bleu_simple_length_ratio': 1.0,
 'bleu_simple_translation_length': 4,
 'bleu_simple_reference_length': 4,
 'bleu_complex_bleu': 0.0,
 'bleu_complex_precisions': [0.75, 0.5, 0.0, 0.0],
 'bleu_complex_brevity_penalty': 1.0,
 'bleu_complex_length_ratio': 1.0,
 'bleu_complex_translation_length': 4,
 'bleu_complex_reference_length': 4
}

This should be especially useful if the same metric with different configs is used in one ensemble.

I am not married to the name EvaluationEnsemble so if there are any better suggestions I am happy to hear them.

HuggingFaceDocBuilderDev · 2022-06-17T14:56:29Z

The documentation is not available anymore as the PR was closed or merged.

douwekiela · 2022-06-17T19:13:15Z

Tbh I'm not sold on the name - won't people get confused about this being a different kind of "ensemble"? I'd call it AggregatedEvaluation or EvaluationAggregation?

lhoestq · 2022-06-20T09:48:12Z

I suggested load_metrics(names: List[str]) and combine_metrics(metrics: List[Metric]) here if it can help: #8 (comment). Maybe the class can be CombinedMetrics ?

Btw I think EvaluationModule is not a good name - I'd like to discuss this more (I missed this suggestion in #40, sorry, and I did not approve #42). In particular for me it feels unnatural for me to understand what it's supposed to represent, especially when implementing a metric. Maybe Metric can be a subclass of EvaluationModule, if you really need the high level abstraction for evaluate.load

lvwerra · 2022-06-20T12:39:02Z

Yes, I am also not totally happy with the name, yet. The reason I didn't go for combine_metrics or load_metrics is that it is again specific to metrics whereas we have Metrics, Comparisons, and Measurements. We could go for combine but I feel it might be a bit too general to understand its purpose intuitively or maybe combine_evaluations. I also like AggregatedEvaluation.

Regarding EvaluationModule I have not a strong opinion. I feel like only a small fraction of user will create a custom module and for them it will already be populated and what's left to do is described. We can create subclasses of EvaluationModule for Metric, Comparison, and Measurement if you think that would make it more intuitive.

lhoestq · 2022-06-23T14:41:35Z

Yes, I am also not totally happy with the name, yet. The reason I didn't go for combine_metrics or load_metrics is that it is again specific to metrics whereas we have Metrics, Comparisons, and Measurements. We could go for combine but I feel it might be a bit too general to understand its purpose intuitively or maybe combine_evaluations. I also like AggregatedEvaluation.

Ok I see ! I'm fine with having combine at the top level to be honest ^^

Regarding EvaluationModule I have not a strong opinion. I feel like only a small fraction of user will create a custom module and for them it will already be populated and what's left to do is described. We can create subclasses of EvaluationModule for Metric, Comparison, and Measurement if you think that would make it more intuitive.

The idea is to make it easy for anyone to read a metric/comparison/measurement script. I think it's easier to understand when we use more relevant names. Though it also depends if it is compatible your plans regarding the abstractions/separations of those types of evaluations

lvwerra · 2022-06-24T16:37:05Z

Following the discussions above I added a combine function and renamed EvaluationEnsemble to AggregateEvaluation. Behind the scenes the AggregateEvaluation still does the heavy lifting and is the object returned by combined.

The naming or EvaluationModule is addressed in #160 separately. Do you think we also need to split AggregateEvalaution into AggregateMetric, AggregateComparison etc.? The plan is not to expose this to the user directly by removing it from the __init__ but they would still see it when inspecting the object returned by combine.

Let me know what you think.

src/evaluate/module.py

sashavor

@douwekiela picked up on all of the typos, otherwise the logic looks good!

lhoestq

Cool ! More comments that mostly come from the Open Source experience I got from the past two years. I'm happy to share these thoughts but please ignore if they slow you down too much

src/evaluate/module.py

lhoestq · 2022-06-27T16:40:34Z

src/evaluate/module.py

+        duplicate_names = [
+            item for item, count in collections.Counter(self.evaluation_module_names).items() if count > 1
+        ]
+        duplicate_counter = {name: 0 for name in duplicate_names}


It's also possible to simply raise an error for now, if we need more time to think about how to handle this case.

As a user it's a bit unclear for me that it would automatically use prefixes when there are name conflicts, but no prefixes if there are no conflicts. This is the kind of black magic we try to avoid usually - though I don't have a strong opinion on this very specific case, since it would be easy to explain in the docs. What do you think ?

an alternative could be to ask users to name the metrics:

combine({ "metric_1": metric_1, "metric_2": metric_2 })

What do you think about adding a warning when a prefixes are added:

f"The key '{key}' occurs in several evaluation results. Prefixing with the key with '{prefix}'."

The user can already use a dict in the current state of the PR (see last example in initial comment). What do you think?

Actually I noticed that even if you provide a dictionary, it may not use a prefix. Therefore the logic is always "use a prefix if there is a name conflict", even with a dictionary. Because of that I don't think a warning is necessary, since a warning would mean that a behavior does not really fit the minimal expected behavior.

This requires to be written explicitly in the docstring though in this case

lhoestq

Thank you ! It looks all good :)

The described behavior is really nice when there are name conflicts actually, I just added a comment to describe it in the docstring

lhoestq · 2022-06-28T13:05:54Z

src/evaluate/module.py

+        duplicate_names = [
+            item for item, count in collections.Counter(self.evaluation_module_names).items() if count > 1
+        ]
+        duplicate_counter = {name: 0 for name in duplicate_names}


Actually I noticed that even if you provide a dictionary, it may not use a prefix. Therefore the logic is always "use a prefix if there is a name conflict", even with a dictionary. Because of that I don't think a warning is necessary, since a warning would mean that a behavior does not really fit the minimal expected behavior.

This requires to be written explicitly in the docstring though in this case

src/evaluate/module.py

lhoestq · 2022-06-28T13:11:28Z

tests/test_metric.py

+                expected_results[f"{dummy_metric.name}_{i}_{k}"] = dummy_result[k]
+        self.assertDictEqual(expected_results, combined_evaluation.compute(predictions=preds, references=refs))
+
+    def test_two_modules_with_same_score_name(self):


you could also test when they have the same module name

I think that is covered in test_duplicate_module, no?

I merge this now, if that's not the case I can add more tests in another PR.

douwekiela

LGTM

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

leandro added 2 commits June 17, 2022 16:25

add EvaluationEnsemble

9e62e43

add tests

06c198e

lvwerra requested review from douwekiela, sashavor and lhoestq June 17, 2022 14:53

lvwerra mentioned this pull request Jun 24, 2022

Rename evaluation modules #160

Merged

3 tasks

leandro added 2 commits June 24, 2022 18:32

add combine and rename EvaluationEnsemble to AggregateEvaluation

f7e2a4c

update tests

15a08d0

douwekiela suggested changes Jun 25, 2022

View reviewed changes

src/evaluate/module.py Outdated Show resolved Hide resolved

src/evaluate/module.py Outdated Show resolved Hide resolved

src/evaluate/module.py Outdated Show resolved Hide resolved

sashavor approved these changes Jun 27, 2022

View reviewed changes

fix typos

d1e5843

lhoestq reviewed Jun 27, 2022

View reviewed changes

leandro added 2 commits June 28, 2022 13:56

rename AggregateEvaluation to CombinedEvaluation

a583294

update the signature and docstring of combine`

63e3a25

lvwerra requested review from douwekiela and lhoestq June 28, 2022 12:38

lvwerra changed the title ~~Add EvaluationEnsemble~~ Add combine to compose multiple evaluations Jun 28, 2022

lhoestq approved these changes Jun 28, 2022

View reviewed changes

douwekiela approved these changes Jun 28, 2022

View reviewed changes

lvwerra and others added 2 commits June 29, 2022 17:09

Apply suggestions from code review

48102f2

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

fix style

f41bc50

lvwerra merged commit 907a0bc into main Jun 29, 2022

lvwerra deleted the add-module-composer branch June 29, 2022 15:45

lvwerra mentioned this pull request Jun 29, 2022

Move kwargs from compute to a config pass during load #169

Closed

lvwerra mentioned this pull request Jul 7, 2022

How to test multiple metrics at one time? #144

Closed

christian-storm mentioned this pull request May 20, 2023

Merged PR #425 doesn't properly address the issue of passing metric specific kwargs to compute or load #462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `combine` to compose multiple evaluations #150

Add `combine` to compose multiple evaluations #150

lvwerra commented Jun 17, 2022

HuggingFaceDocBuilderDev commented Jun 17, 2022 •

edited

Loading

douwekiela commented Jun 17, 2022

lhoestq commented Jun 20, 2022 •

edited

Loading

lvwerra commented Jun 20, 2022

lhoestq commented Jun 23, 2022

lvwerra commented Jun 24, 2022

sashavor left a comment

lhoestq left a comment

lhoestq Jun 27, 2022

lvwerra Jun 28, 2022

lhoestq Jun 28, 2022

lhoestq left a comment

lhoestq Jun 28, 2022

lhoestq Jun 28, 2022

lvwerra Jun 29, 2022

lvwerra Jun 29, 2022

douwekiela left a comment

Add combine to compose multiple evaluations #150

Add combine to compose multiple evaluations #150

Conversation

lvwerra commented Jun 17, 2022

HuggingFaceDocBuilderDev commented Jun 17, 2022 • edited Loading

douwekiela commented Jun 17, 2022

lhoestq commented Jun 20, 2022 • edited Loading

lvwerra commented Jun 20, 2022

lhoestq commented Jun 23, 2022

lvwerra commented Jun 24, 2022

sashavor left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jun 27, 2022

Choose a reason for hiding this comment

lvwerra Jun 28, 2022

Choose a reason for hiding this comment

lhoestq Jun 28, 2022

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Jun 28, 2022

Choose a reason for hiding this comment

lhoestq Jun 28, 2022

Choose a reason for hiding this comment

lvwerra Jun 29, 2022

Choose a reason for hiding this comment

lvwerra Jun 29, 2022

Choose a reason for hiding this comment

douwekiela left a comment

Choose a reason for hiding this comment

Add `combine` to compose multiple evaluations #150

Add `combine` to compose multiple evaluations #150

HuggingFaceDocBuilderDev commented Jun 17, 2022 •

edited

Loading

lhoestq commented Jun 20, 2022 •

edited

Loading