Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add combine to compose multiple evaluations #150

Merged
merged 9 commits into from
Jun 29, 2022
Merged

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Jun 17, 2022

This PR adds the EvaluationEnsemble class that lets users combine several EvaluationModules into a single, compact object that behaves like a single module. Here's how it works:

from evaluate import EvaluationEnsemble

clf_metrics = EvaluationEnsemble(["accuracy", "f1", "precision","recall"])
clf_metrics.compute(predictions=[0,1], references=[1,1])

{
    'accuracy': 0.5, 
    'f1': 0.666, 
    'precision': 1.0,
    'recall': 0.5
}

Under the hood the add, add_batch, and compute function iteratively call the methods with the same name for each module.

There are a few scenarios for the returned dict:

  • if there are name clashes in the score names the module name is prefixed
  • if there are name clashes in the modules the prefixes are additionally enumerated
  • one can force the prefix with force_prefix=True at initialization
  • for the name the module names are used except if the modules are passed as a dict in which case the keys are used.

Here an example to illustrate forced prefixing with custom names:

bleu_metrics = EvaluationEnsemble({
    "bleu_simple": evaluate.load("bleu"),
    "bleu_complex": evaluate.load("bleu")}
)

{
 'bleu_simple_bleu': 0.0,
 'bleu_simple_precisions': [0.75, 0.5, 0.0, 0.0],
 'bleu_simple_brevity_penalty': 1.0,
 'bleu_simple_length_ratio': 1.0,
 'bleu_simple_translation_length': 4,
 'bleu_simple_reference_length': 4,
 'bleu_complex_bleu': 0.0,
 'bleu_complex_precisions': [0.75, 0.5, 0.0, 0.0],
 'bleu_complex_brevity_penalty': 1.0,
 'bleu_complex_length_ratio': 1.0,
 'bleu_complex_translation_length': 4,
 'bleu_complex_reference_length': 4
}

This should be especially useful if the same metric with different configs is used in one ensemble.

I am not married to the name EvaluationEnsemble so if there are any better suggestions I am happy to hear them.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 17, 2022

The documentation is not available anymore as the PR was closed or merged.

@douwekiela
Copy link
Contributor

Tbh I'm not sold on the name - won't people get confused about this being a different kind of "ensemble"? I'd call it AggregatedEvaluation or EvaluationAggregation?

@lhoestq
Copy link
Member

lhoestq commented Jun 20, 2022

I suggested load_metrics(names: List[str]) and combine_metrics(metrics: List[Metric]) here if it can help: #8 (comment). Maybe the class can be CombinedMetrics ?

Btw I think EvaluationModule is not a good name - I'd like to discuss this more (I missed this suggestion in #40, sorry, and I did not approve #42). In particular for me it feels unnatural for me to understand what it's supposed to represent, especially when implementing a metric. Maybe Metric can be a subclass of EvaluationModule, if you really need the high level abstraction for evaluate.load

@lvwerra
Copy link
Member Author

lvwerra commented Jun 20, 2022

Yes, I am also not totally happy with the name, yet. The reason I didn't go for combine_metrics or load_metrics is that it is again specific to metrics whereas we have Metrics, Comparisons, and Measurements. We could go for combine but I feel it might be a bit too general to understand its purpose intuitively or maybe combine_evaluations. I also like AggregatedEvaluation.

Regarding EvaluationModule I have not a strong opinion. I feel like only a small fraction of user will create a custom module and for them it will already be populated and what's left to do is described. We can create subclasses of EvaluationModule for Metric, Comparison, and Measurement if you think that would make it more intuitive.

@lhoestq
Copy link
Member

lhoestq commented Jun 23, 2022

Yes, I am also not totally happy with the name, yet. The reason I didn't go for combine_metrics or load_metrics is that it is again specific to metrics whereas we have Metrics, Comparisons, and Measurements. We could go for combine but I feel it might be a bit too general to understand its purpose intuitively or maybe combine_evaluations. I also like AggregatedEvaluation.

Ok I see ! I'm fine with having combine at the top level to be honest ^^

Regarding EvaluationModule I have not a strong opinion. I feel like only a small fraction of user will create a custom module and for them it will already be populated and what's left to do is described. We can create subclasses of EvaluationModule for Metric, Comparison, and Measurement if you think that would make it more intuitive.

The idea is to make it easy for anyone to read a metric/comparison/measurement script. I think it's easier to understand when we use more relevant names. Though it also depends if it is compatible your plans regarding the abstractions/separations of those types of evaluations

@lvwerra lvwerra mentioned this pull request Jun 24, 2022
3 tasks
@lvwerra
Copy link
Member Author

lvwerra commented Jun 24, 2022

Following the discussions above I added a combine function and renamed EvaluationEnsemble to AggregateEvaluation. Behind the scenes the AggregateEvaluation still does the heavy lifting and is the object returned by combined.

The naming or EvaluationModule is addressed in #160 separately. Do you think we also need to split AggregateEvalaution into AggregateMetric, AggregateComparison etc.? The plan is not to expose this to the user directly by removing it from the __init__ but they would still see it when inspecting the object returned by combine.

Let me know what you think.

src/evaluate/module.py Outdated Show resolved Hide resolved
src/evaluate/module.py Outdated Show resolved Hide resolved
src/evaluate/module.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sashavor sashavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douwekiela picked up on all of the typos, otherwise the logic looks good!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! More comments that mostly come from the Open Source experience I got from the past two years. I'm happy to share these thoughts but please ignore if they slow you down too much

src/evaluate/module.py Outdated Show resolved Hide resolved
Comment on lines +791 to +794
duplicate_names = [
item for item, count in collections.Counter(self.evaluation_module_names).items() if count > 1
]
duplicate_counter = {name: 0 for name in duplicate_names}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to simply raise an error for now, if we need more time to think about how to handle this case.

As a user it's a bit unclear for me that it would automatically use prefixes when there are name conflicts, but no prefixes if there are no conflicts. This is the kind of black magic we try to avoid usually - though I don't have a strong opinion on this very specific case, since it would be easy to explain in the docs. What do you think ?

an alternative could be to ask users to name the metrics:

combine({
    "metric_1": metric_1,
    "metric_2": metric_2
})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a warning when a prefixes are added:

f"The key '{key}' occurs in several evaluation results. Prefixing with the key with '{prefix}'."

The user can already use a dict in the current state of the PR (see last example in initial comment). What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I noticed that even if you provide a dictionary, it may not use a prefix. Therefore the logic is always "use a prefix if there is a name conflict", even with a dictionary. Because of that I don't think a warning is necessary, since a warning would mean that a behavior does not really fit the minimal expected behavior.

This requires to be written explicitly in the docstring though in this case

@lvwerra lvwerra changed the title Add EvaluationEnsemble Add combine to compose multiple evaluations Jun 28, 2022
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ! It looks all good :)

The described behavior is really nice when there are name conflicts actually, I just added a comment to describe it in the docstring

Comment on lines +791 to +794
duplicate_names = [
item for item, count in collections.Counter(self.evaluation_module_names).items() if count > 1
]
duplicate_counter = {name: 0 for name in duplicate_names}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I noticed that even if you provide a dictionary, it may not use a prefix. Therefore the logic is always "use a prefix if there is a name conflict", even with a dictionary. Because of that I don't think a warning is necessary, since a warning would mean that a behavior does not really fit the minimal expected behavior.

This requires to be written explicitly in the docstring though in this case

src/evaluate/module.py Show resolved Hide resolved
expected_results[f"{dummy_metric.name}_{i}_{k}"] = dummy_result[k]
self.assertDictEqual(expected_results, combined_evaluation.compute(predictions=preds, references=refs))

def test_two_modules_with_same_score_name(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also test when they have the same module name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is covered in test_duplicate_module, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merge this now, if that's not the case I can add more tests in another PR.

Copy link
Contributor

@douwekiela douwekiela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lvwerra and others added 2 commits June 29, 2022 17:09
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@lvwerra lvwerra merged commit 907a0bc into main Jun 29, 2022
@lvwerra lvwerra deleted the add-module-composer branch June 29, 2022 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants