Skip to content

Commit

Permalink
Code review
Browse files Browse the repository at this point in the history
  • Loading branch information
mathemakitten committed Nov 29, 2022
1 parent fb89329 commit 7da77ed
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 53 deletions.
101 changes: 60 additions & 41 deletions docs/source/a_quick_tour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -199,47 +199,6 @@ The `combine` function accepts both the list of names of the metrics as well as
}
```

## Running evaluation on a suite of tasks

It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.

`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.

```python
import evaluate
from evaluate.evaluation_suite import SubTask

class Suite(evaluate.EvaluationSuite):

def __init__(self, name):
super().__init__(name)
self.preprocessor = lambda x: {"text": x["text"].lower()}
self.suite = [
SubTask(
task_type="text-classification",
data="glue",
subset="cola",
split="test[:10]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
)]
```

Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.

```python
from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
results = suite.run("gpt2")
```

## Save and push to the Hub

Saving and sharing evaluation results is an important step. We provide the [`evaluate.save`] function to easily save metrics results. You can either pass a specific filename or a directory. In the latter case, the results are saved in a file with an automatically created file name. Besides the directory or file name, the function takes any key-value pairs as inputs and stores them in a JSON file.
Expand Down Expand Up @@ -332,3 +291,63 @@ Calculating the value of the metric alone is often not enough to know if a model
```

The evaluator expects a `"text"` and `"label"` column for the data input. If your dataset differs you can provide the columns with the keywords `input_column="text"` and `label_column="label"`. Currently only `"text-classification"` is supported with more tasks being added in the future.

## Running evaluation on a suite of tasks

It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The [EvaluationSuite](evaluation_suite) enables evaluation of models on a collection of tasks. Tasks can be constructed as ([evaluator](base_evaluator), dataset, metric) tuples and passed to an [EvaluationSuite](evaluation_suite) stored on the Hugging Face Hub as a Space, or locally as a Python script. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.

`EvaluationSuite` scripts can be defined as follows, and supports Python code for data preprocessing.

```python
import evaluate
from evaluate.evaluation_suite import SubTask

class Suite(evaluate.EvaluationSuite):

def __init__(self, name):
super().__init__(name)

self.suite = [
SubTask(
task_type="text-classification",
data="imdb",
split="test[:1]",
args_for_task={
"metric": "accuracy",
"input_column": "text",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
),
SubTask(
task_type="text-classification",
data="sst2",
split="test[:1]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0.0,
"LABEL_1": 1.0
}
}
)
]
```

Evaluation can be run by loading the `EvaluationSuite` and calling `run()` method with a model or pipeline.

```
>>> from evaluate import EvaluationSuite
>>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
>>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
|------------:|---------------------:|--------------------------:|:----------------|:-----------|
| 0.3 | 4.62804 | 2.16074 | 0.462804 | imdb |
| 0 | 0.686388 | 14.569 | 0.0686388 | sst2 |
```
42 changes: 30 additions & 12 deletions docs/source/evaluation_suite.mdx
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Creating an EvaluationSuite

The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions.

A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally.
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks.

A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script.

Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute.

Expand All @@ -19,8 +21,8 @@ class Suite(evaluate.EvaluationSuite):
SubTask(
task_type="text-classification",
data="glue",
subset="cola",
split="test[:10]",
subset="sst2",
split="validation[:10]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence",
Expand All @@ -30,19 +32,35 @@ class Suite(evaluate.EvaluationSuite):
"LABEL_1": 1.0
}
}
),
SubTask(
task_type="text-classification",
data="glue",
subset="rte",
split="validation[:10]",
args_for_task={
"metric": "accuracy",
"input_column": "sentence1",
"second_input_column": "sentence2",
"label_column": "label",
"label_mapping": {
"LABEL_0": 0,
"LABEL_1": 1
}
}
)
]
```

An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`.

```python
import pandas as pd
from evaluate import EvaluationSuite

suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
results = suite.run("gpt2")
```
>>> from evaluate import EvaluationSuite
>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
>>> results = suite.run("gpt2")
results = [{'accuracy': 0.0, 'total_time_in_seconds': 0.6330130019999842, 'samples_per_second': 15.797463825237905, 'latency_in_seconds': 0.06330130019999843, 'task_name': 'glue/cola', 'data_preprocessor': None}, {'accuracy': 0.5, 'total_time_in_seconds': 0.7627554609999834, 'samples_per_second': 13.110361723126644, 'latency_in_seconds': 0.07627554609999834, 'task_name': 'glue/sst2', 'data_preprocessor': None}]
print(pd.DataFrame(results))
| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name |
|-----------:|------------------------:|---------------------:|---------------------:|:------------|
| 0.5 | 0.740811 | 13.4987 | 0.0740811 | glue/sst2 |
| 0.4 | 1.67552 | 5.9683 | 0.167552 | glue/rte |
```

0 comments on commit 7da77ed

Please sign in to comment.