huggingface · lvwerra · Dec 9, 2022 · Nov 1, 2022 · Nov 1, 2022 · Nov 1, 2022
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -17,6 +17,8 @@
     title: Using the evaluator
   - local: custom_evaluator
     title: Using the evaluator with custom pipelines
+  - local: evaluation_suite
+    title: Creating an EvaluationSuite
   - sections:
     - local: transformers_integrations
       title: Transformers

diff --git a/docs/source/a_quick_tour.mdx b/docs/source/a_quick_tour.mdx
@@ -182,7 +182,6 @@ This solution allows 🤗 Evaluate to perform distributed predictions, which is
 
 Often one wants to not only evaluate a single metric but a range of different metrics capturing different aspects of a model. E.g. for classification it is usually a good idea to compute F1-score, recall, and precision in addition to accuracy to get a better picture of model performance. Naturally, you can load a bunch of metrics and call them sequentially. However, a more convenient way is to use the [`~evaluate.combine`] function to bundle them together:
 
-
 ```python
 >>> clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
 ```
@@ -319,3 +318,63 @@ Which lets you visually compare the 4 models and choose the optimal one for you,
 <div class="flex justify-center">
     <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/example_viz.png" width="400"/>
 </div>
+
+## Running evaluation on a suite of tasks
+
+It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
+
+`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.
+
+```python
+import evaluate
+from evaluate.evaluation_suite import SubTask
+
+class Suite(evaluate.EvaluationSuite):
+
+    def __init__(self, name):
+        super().__init__(name)
+
+        self.suite = [
+            SubTask(
+                task_type="text-classification",
+                data="imdb",
+                split="test[:1]",
+                args_for_task={
+                    "metric": "accuracy",
+                    "input_column": "text",
+                    "label_column": "label",
+                    "label_mapping": {
+                        "LABEL_0": 0.0,
+                        "LABEL_1": 1.0
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="sst2",
+                split="test[:1]",
+                args_for_task={
+                    "metric": "accuracy",
+                    "input_column": "sentence",
+                    "label_column": "label",
+                    "label_mapping": {
+                        "LABEL_0": 0.0,
+                        "LABEL_1": 1.0
+                    }
+                }
+            )
+        ]
+```
+
+Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.
+
+```
+>>> from evaluate import EvaluationSuite
+>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
+>>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
+```
+
+|   accuracy |   total_time_in_seconds |   samples_per_second |   latency_in_seconds | task_name   |
+|------------:|---------------------:|--------------------------:|:----------------|:-----------|
+|        0.3 |                4.62804  |              2.16074 |            0.462804  | imdb        |
+|        0   |                0.686388 |             14.569   |            0.0686388 | sst2        |
diff --git a/docs/source/base_evaluator.mdx b/docs/source/base_evaluator.mdx
@@ -13,6 +13,7 @@ Currently supported tasks are:
 - `"translation"`: will use the [`TranslationEvaluator`].
 - `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`].
 
+To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s.
 
 Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time.
 

diff --git a/docs/source/evaluation_suite.mdx b/docs/source/evaluation_suite.mdx
@@ -0,0 +1,74 @@
+# Creating an EvaluationSuite
+
+It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions.
+
+The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
+
+A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script.
+
+Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute.
+
+To create a new `EvaluationSuite`, create a [new Space](https://huggingface.co/new-space) with a .py file which matches the name of the Space, add the below template to a Python file, and fill in the attributes for a new task.
+
+The mandatory attributes for a new `SubTask` are `task_type` and `data`.
+1. [`task_type`] maps to the tasks currently supported by the Evaluator.
+2. [`data`] can be an instantiated Hugging Face dataset object or the name of a dataset.
+3. [`subset`] and [`split`] can be used to define which name and split of the dataset should be used for evaluation.
+4. [`args_for_task`] should be a dictionary with kwargs to be passed to the Evaluator.
+
+```python
+import evaluate
+from evaluate.evaluation_suite import SubTask
+
+class Suite(evaluate.EvaluationSuite):
+
+    def __init__(self, name):
+        super().__init__(name)
+        self.preprocessor = lambda x: {"text": x["text"].lower()}
+        self.suite = [
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="sst2",
+                split="validation[:10]",
+                args_for_task={
+                    "metric": "accuracy",
+                    "input_column": "sentence",
+                    "label_column": "label",
+                    "label_mapping": {
+                        "LABEL_0": 0.0,
+                        "LABEL_1": 1.0
+                    }
+                }
+            ),
+            SubTask(
+                task_type="text-classification",
+                data="glue",
+                subset="rte",
+                split="validation[:10]",
+                args_for_task={
+                    "metric": "accuracy",
+                    "input_column": "sentence1",
+                    "second_input_column": "sentence2",
+                    "label_column": "label",
+                    "label_mapping": {
+                        "LABEL_0": 0,
+                        "LABEL_1": 1
+                    }
+                }
+            )
+        ]
+```
+
+An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`:
+
+```
+>>> from evaluate import EvaluationSuite
+>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
+>>> results = suite.run("gpt2")
+```
+
+|   accuracy |   total_time_in_seconds |   samples_per_second |   latency_in_seconds | task_name   |
+|-----------:|------------------------:|---------------------:|---------------------:|:------------|
+|        0.5 |                0.740811 |             13.4987  |            0.0740811 | glue/sst2   |
+|        0.4 |                1.67552  |              5.9683  |            0.167552  | glue/rte    |