Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for EvaluationSuite #340

Merged
merged 21 commits into from
Dec 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
title: Using the evaluator
- local: custom_evaluator
title: Using the evaluator with custom pipelines
- local: evaluation_suite
title: Creating an EvaluationSuite
- sections:
- local: transformers_integrations
title: Transformers
Expand Down
61 changes: 60 additions & 1 deletion docs/source/a_quick_tour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,6 @@ This solution allows 🤗 Evaluate to perform distributed predictions, which is

Often one wants to not only evaluate a single metric but a range of different metrics capturing different aspects of a model. E.g. for classification it is usually a good idea to compute F1-score, recall, and precision in addition to accuracy to get a better picture of model performance. Naturally, you can load a bunch of metrics and call them sequentially. However, a more convenient way is to use the [`~evaluate.combine`] function to bundle them together:


```python
>>> clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
```
Expand Down Expand Up @@ -319,3 +318,63 @@ Which lets you visually compare the 4 models and choose the optimal one for you,
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/evaluate/media/resolve/main/example_viz.png" width="400"/>
</div>

## Running evaluation on a suite of tasks

It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.

`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.

```python
import evaluate
from evaluate.evaluation_suite import SubTask

class Suite(evaluate.EvaluationSuite):

def __init__(self, name):
super().__init__(name)

self.suite = [
SubTask(
task_type="text-classification",
data="imdb",
split="test[:1]",
args_for_task={
"metric": "accuracy",
"input_column": "text",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
),
SubTask(
task_type="text-classification",
data="sst2",
split="test[:1]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
)
]
```

Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.

```
>>> from evaluate import EvaluationSuite
>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
>>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
```

| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
|------------:|---------------------:|--------------------------:|:----------------|:-----------|
| 0.3 | 4.62804 | 2.16074 | 0.462804 | imdb |
| 0 | 0.686388 | 14.569 | 0.0686388 | sst2 |
1 change: 1 addition & 0 deletions docs/source/base_evaluator.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Currently supported tasks are:
- `"translation"`: will use the [`TranslationEvaluator`].
- `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`].

To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s.

Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time.

Expand Down
74 changes: 74 additions & 0 deletions docs/source/evaluation_suite.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Creating an EvaluationSuite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to give instructions how to add a new Suite on the hub (for people who don't know how). For metrics there is a small CLI using Cookiecutter where you specify the name of the metric and it creates the spaces, clones it locally and adds template files. Similar to the metric modules we could also add

  • README: document the goal and limitations of the suite
  • app.py: a simple gradio demo or so that parses the Suite.py/README and displays useful information
    If you want I could help with that. I think reducing friction to create a new suite to the minimum will maximise adoption. Happy to do it in a follow up PR but I think it would be great to have with the release/announcement. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good, having a README and/or template seems broadly useful!

I've added some more instructions in evaluation_suite.mdx as well.


mathemakitten marked this conversation as resolved.
Show resolved Hide resolved
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions.

The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.

A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script.

Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute.

To create a new `EvaluationSuite`, create a [new Space](https://huggingface.co/new-space) with a .py file which matches the name of the Space, add the below template to a Python file, and fill in the attributes for a new task.

The mandatory attributes for a new `SubTask` are `task_type` and `data`.
1. [`task_type`] maps to the tasks currently supported by the Evaluator.
2. [`data`] can be an instantiated Hugging Face dataset object or the name of a dataset.
3. [`subset`] and [`split`] can be used to define which name and split of the dataset should be used for evaluation.
4. [`args_for_task`] should be a dictionary with kwargs to be passed to the Evaluator.

```python
import evaluate
from evaluate.evaluation_suite import SubTask

class Suite(evaluate.EvaluationSuite):

def __init__(self, name):
super().__init__(name)
self.preprocessor = lambda x: {"text": x["text"].lower()}
self.suite = [
SubTask(
task_type="text-classification",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you list the available task types maybe ? Or redirect to their docs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a link to the supported tasks on the Evaluator docs so we don't have to maintain the list in two places!

data="glue",
subset="sst2",
split="validation[:10]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
),
SubTask(
task_type="text-classification",
data="glue",
subset="rte",
split="validation[:10]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence1",
"second_input_column": "sentence2",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0,
"LABEL_1": 1
}
}
)
]
```

An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`:

```
>>> from evaluate import EvaluationSuite
>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
>>> results = suite.run("gpt2")
```

| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would do the same here and remove the table from the codeblock so it's actually rendered as a nice table.

|-----------:|------------------------:|---------------------:|---------------------:|:------------|
| 0.5 | 0.740811 | 13.4987 | 0.0740811 | glue/sst2 |
| 0.4 | 1.67552 | 5.9683 | 0.167552 | glue/rte |