-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs for EvaluationSuite #340
Changes from all commits
05c3b32
c409f3e
5011b28
96c3a0e
f7d014a
3bf419e
8309ad2
c232876
064539b
d5bc020
cb921fe
da413f8
7fe7a57
f0b5897
d97b7fb
fb89329
7da77ed
f52ac02
52abfd7
474fa55
8ca9b1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# Creating an EvaluationSuite | ||
|
||
mathemakitten marked this conversation as resolved.
Show resolved
Hide resolved
|
||
It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. Assessing the model on several types of tasks can reveal gaps in performance along some axis. For example, when training a language model, it is often useful to measure perplexity on an in-domain corpus, but also to concurrently evaluate on tasks which test for general language capabilities like natural language entailment or question-answering, or tasks designed to probe the model along fairness and bias dimensions. | ||
|
||
The `EvaluationSuite` provides a way to compose any number of ([evaluator](base_evaluator), dataset, metric) tuples as a SubTask to evaluate a model on a collection of several evaluation tasks. See the [evaluator documentation](base_evaluator) for a list of currently supported tasks. | ||
|
||
A new `EvaluationSuite` is made up of a list of `SubTask` classes, each defining an evaluation task. The Python file containing the definition can be uploaded to a Space on the Hugging Face Hub so it can be shared with the community or saved/loaded locally as a Python script. | ||
|
||
Some datasets require additional preprocessing before passing them to an `Evaluator`. You can set a `data_preprocessor` for each `SubTask` which is applied via a `map` operation using the `datasets` library. Keyword arguments for the `Evaluator` can be passed down through the `args_for_task` attribute. | ||
|
||
To create a new `EvaluationSuite`, create a [new Space](https://huggingface.co/new-space) with a .py file which matches the name of the Space, add the below template to a Python file, and fill in the attributes for a new task. | ||
|
||
The mandatory attributes for a new `SubTask` are `task_type` and `data`. | ||
1. [`task_type`] maps to the tasks currently supported by the Evaluator. | ||
2. [`data`] can be an instantiated Hugging Face dataset object or the name of a dataset. | ||
3. [`subset`] and [`split`] can be used to define which name and split of the dataset should be used for evaluation. | ||
4. [`args_for_task`] should be a dictionary with kwargs to be passed to the Evaluator. | ||
|
||
```python | ||
import evaluate | ||
from evaluate.evaluation_suite import SubTask | ||
|
||
class Suite(evaluate.EvaluationSuite): | ||
|
||
def __init__(self, name): | ||
super().__init__(name) | ||
self.preprocessor = lambda x: {"text": x["text"].lower()} | ||
self.suite = [ | ||
SubTask( | ||
task_type="text-classification", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you list the available task types maybe ? Or redirect to their docs ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a link to the supported tasks on the Evaluator docs so we don't have to maintain the list in two places! |
||
data="glue", | ||
subset="sst2", | ||
split="validation[:10]", | ||
args_for_task={ | ||
"metric": "accuracy", | ||
"input_column": "sentence", | ||
"label_column": "label", | ||
"label_mapping": { | ||
"LABEL_0": 0.0, | ||
"LABEL_1": 1.0 | ||
} | ||
} | ||
), | ||
SubTask( | ||
task_type="text-classification", | ||
data="glue", | ||
subset="rte", | ||
split="validation[:10]", | ||
args_for_task={ | ||
"metric": "accuracy", | ||
"input_column": "sentence1", | ||
"second_input_column": "sentence2", | ||
"label_column": "label", | ||
"label_mapping": { | ||
"LABEL_0": 0, | ||
"LABEL_1": 1 | ||
} | ||
} | ||
) | ||
] | ||
``` | ||
|
||
An `EvaluationSuite` can be loaded by name from the Hugging Face Hub, or locally by providing a path, and run with the `run(model_or_pipeline)` method. The evaluation results are returned along with their task names and information about the time it took to obtain predictions through the pipeline. These can be easily displayed with a `pandas.DataFrame`: | ||
|
||
``` | ||
>>> from evaluate import EvaluationSuite | ||
>>> suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite') | ||
>>> results = suite.run("gpt2") | ||
``` | ||
|
||
| accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds | task_name | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would do the same here and remove the table from the codeblock so it's actually rendered as a nice table. |
||
|-----------:|------------------------:|---------------------:|---------------------:|:------------| | ||
| 0.5 | 0.740811 | 13.4987 | 0.0740811 | glue/sst2 | | ||
| 0.4 | 1.67552 | 5.9683 | 0.167552 | glue/rte | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to give instructions how to add a new Suite on the hub (for people who don't know how). For metrics there is a small CLI using Cookiecutter where you specify the name of the metric and it creates the spaces, clones it locally and adds template files. Similar to the metric modules we could also add
If you want I could help with that. I think reducing friction to create a new suite to the minimum will maximise adoption. Happy to do it in a follow up PR but I think it would be great to have with the release/announcement. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good, having a README and/or template seems broadly useful!
I've added some more instructions in evaluation_suite.mdx as well.