Evaluation suite #337

mathemakitten · 2022-11-01T23:26:45Z

See here for an example of how to use it.

Run GLUE with:

from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/glue-suite-v2')
results = suite.run("gpt2")

Similarly, a simpler sentiment analysis benchmark exists at mathemakitten/sentiment-evaluation-suite

Supercedes #302.

HuggingFaceDocBuilderDev · 2022-11-01T23:30:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

NimaBoscarino

Awesome! This looks great to me, and having suites be subclasses of EvaluationSuite is nice because then it'll let power users extend functionality without having to impact the Evaluate library. (e.g. I can see situations where someone might want to both map and filter as part of the preprocessing)

I'm hyped to use it!

lvwerra

Thanks @mathemakitten for working on this - this is looking great!

On the loading API side I think the overloading of load in evaluate.load and evaluate and evaluate.evaluation_suite.load. What do you think about adding load as a class method and going for the following API:

from evaluate import EvaluationSuite
suite = EvaluationSuite.load("...")

Then I think we can also integrate setup() into __init__() so we have less redundant methods?

Finally, what do you think about adding a __repr__() or save_metadata() method where one can display the suite in a nice format and save the tasks as JSON or YAML for reproducibility. Can also be in a follow up PR.

src/evaluate/evaluation_suite/__init__.py

lhoestq

Awesome !
Don't forget to add some tests and docs ;)

src/evaluate/evaluation_suite/__init__.py

lhoestq · 2022-11-02T10:38:11Z

src/evaluate/evaluation_suite/__init__.py

+
+
+@dataclass
+class SubTask:


Did you end up with a better name for this class ?

Thanks for reminding me! From the other PR thread, EvaluationJob, EvaluationData or DataSource were suggested.

I assume that you're referring to the tasks defined at https://huggingface.co/tasks — in which case, maybe SubTask is actually quite fitting? For example, the tasks listed under the Text Classification task umbrella are NLI, QNLI, MNLI, QQP — which right now correspond 1:1 to the SubTask object in EvaluationSuite (see an example of GLUE as an EvaluationSuite defined here).

A flip through NLP + CV lit suggests that "Task" is really the canonical name for this type of thing — see reference to NLP tasks in Dynabench, SuperGLUE, the Eleuther LM Harness, and computer vision as well like in GRIT. I'd love to keep convention with the field as much as possible so it's obvious to newcomers what the atomic unit of an EvaluationSuite should be. WDYT?

src/evaluate/evaluation_suite/__init__.py

mathemakitten · 2022-11-04T17:26:11Z

Thanks @lvwerra @lhoestq @NimaBoscarino for all the thoughtful comments! The test case hosted under the evaluate org includes testing the data preprocessing. Let me know if you have other suggestions for things we should be checking.

@NimaBoscarino has been battle-testing this with the new bias/fairness metrics and said that having the data preprocessing in code made it easy to override, so thank you all for helping to hash out the API for this feature :)

lvwerra

Thanks @mathemakitten this looks great! I think we can extend the tests a bit and make a dedicated test file e.g. test_evaluation_suite.py. Things I would test in addition:

does it work with and without preprocessor - also check expected result.
it think it's good to load the suite on the hub one, but for the rest you can define local test suites. like in the evaluator you can also use a dummy pipeline to avoid loading actual models (the CI is getting slow enough already :( )
maybe there are some failure cases we should check? E.g. should it throw an error if the suite is an empty list?

NimaBoscarino · 2022-11-08T15:20:17Z

tests/test_evaluation_suite.py

+
+class TestEvaluationSuite(TestCase):
+    def test_suite(self):
+        suite = EvaluationSuite.load("evaluate/evaluation-suite-ci")


Would it make sense to just mock evaluator for this? Or for a separate test? That way we can test a bunch of different EvaluationSuite configs without actually running inferences, which would be useful for TDD.

Good idea! I'd like to make sure we have at least one test which actually tests loading from a script on the Hub like this, but mocking for the rest to save compute/time.

HuggingFaceDocBuilder · 2022-11-14T22:38:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

lvwerra

Looks great! Just a few minor nits, then it's ready to go :)

src/evaluate/evaluation_suite/__init__.py

tests/test_evaluation_suite.py

lhoestq

LGTM ! ~~Can you add some documentation on this ? Maybe with a dedicated page in the docs ?~~

lhoestq · 2022-11-16T13:32:56Z

Oh actually just saw that you opened #340 - nervermind

mathemakitten added 3 commits November 1, 2022 01:24

Evaluation suite runs

05c3b32

Runs but needs refactor

c409f3e

Refactor complete

5011b28

mathemakitten force-pushed the hn-evaluation-suite-v2 branch from 7c57637 to 5011b28 Compare November 1, 2022 23:27

mathemakitten requested review from NimaBoscarino and lvwerra November 1, 2022 23:28

mathemakitten mentioned this pull request Nov 1, 2022

Evaluation suite #302

Closed

NimaBoscarino approved these changes Nov 2, 2022

View reviewed changes

lvwerra reviewed Nov 2, 2022

View reviewed changes

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved

lvwerra requested a review from lhoestq November 2, 2022 09:25

lhoestq reviewed Nov 2, 2022

View reviewed changes

Code review comments

96c3a0e

mathemakitten force-pushed the hn-evaluation-suite-v2 branch from ff8091d to 96c3a0e Compare November 2, 2022 20:14

mathemakitten mentioned this pull request Nov 3, 2022

Docs for EvaluationSuite #340

Merged

NimaBoscarino reviewed Nov 4, 2022

View reviewed changes

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved

Cleanup and tests

748e72d

mathemakitten force-pushed the hn-evaluation-suite-v2 branch from 8309ad2 to 748e72d Compare November 4, 2022 13:19

lvwerra reviewed Nov 7, 2022

View reviewed changes

Add tests and post init checks and tiny fixes

81bed0a

NimaBoscarino reviewed Nov 8, 2022

View reviewed changes

This was referenced Nov 8, 2022

Allow EvaluationSuite to receive a Preprocessor #347

Closed

Allow EvaluationSuite to receive a custom Evaluator #348

Closed

mathemakitten force-pushed the hn-evaluation-suite-v2 branch 2 times, most recently from 833cc78 to e64cd83 Compare November 8, 2022 22:55

Update tests and give tasks unique IDs

f8da2a6

mathemakitten force-pushed the hn-evaluation-suite-v2 branch from e64cd83 to f8da2a6 Compare November 8, 2022 23:00

fix tests

3ab6be7

mathemakitten force-pushed the hn-evaluation-suite-v2 branch from 2dac7fe to 3ab6be7 Compare November 9, 2022 06:06

NimaBoscarino mentioned this pull request Nov 10, 2022

Pass metric init_kwargs from the Evaluators to metrics #351

Closed

Make output list of dicts instead

2e243c8

Test update

f9f423d

lvwerra reviewed Nov 15, 2022

View reviewed changes

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved

src/evaluate/evaluation_suite/__init__.py Show resolved Hide resolved

tests/test_evaluation_suite.py Outdated Show resolved Hide resolved

mathemakitten added 3 commits November 15, 2022 10:08

test updates

0975811

Sub dummy model

0a70d1d

code quality

637410b

lvwerra requested a review from lhoestq November 15, 2022 19:07

lvwerra approved these changes Nov 16, 2022

View reviewed changes

lhoestq reviewed Nov 16, 2022

View reviewed changes

lhoestq approved these changes Nov 16, 2022

View reviewed changes

mathemakitten merged commit a12836b into main Nov 16, 2022

mathemakitten deleted the hn-evaluation-suite-v2 branch November 16, 2022 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation suite #337

Evaluation suite #337

mathemakitten commented Nov 1, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 1, 2022

NimaBoscarino left a comment

lvwerra left a comment

lhoestq left a comment

lhoestq Nov 2, 2022

mathemakitten Nov 2, 2022

mathemakitten commented Nov 4, 2022

lvwerra left a comment

NimaBoscarino Nov 8, 2022

mathemakitten Nov 8, 2022

HuggingFaceDocBuilder commented Nov 14, 2022

lvwerra left a comment

lhoestq left a comment •

edited

Loading

lhoestq commented Nov 16, 2022 •

edited

Loading

Evaluation suite #337

Evaluation suite #337

Conversation

mathemakitten commented Nov 1, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Nov 1, 2022

NimaBoscarino left a comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Nov 2, 2022

Choose a reason for hiding this comment

mathemakitten Nov 2, 2022

Choose a reason for hiding this comment

mathemakitten commented Nov 4, 2022

lvwerra left a comment

Choose a reason for hiding this comment

NimaBoscarino Nov 8, 2022

Choose a reason for hiding this comment

mathemakitten Nov 8, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilder commented Nov 14, 2022

lvwerra left a comment

Choose a reason for hiding this comment

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq commented Nov 16, 2022 • edited Loading

mathemakitten commented Nov 1, 2022 •

edited

Loading

lhoestq left a comment •

edited

Loading

lhoestq commented Nov 16, 2022 •

edited

Loading