-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation suite #337
Evaluation suite #337
Conversation
7c57637
to
5011b28
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This looks great to me, and having suites be subclasses of EvaluationSuite
is nice because then it'll let power users extend functionality without having to impact the Evaluate library. (e.g. I can see situations where someone might want to both map and filter as part of the preprocessing)
I'm hyped to use it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mathemakitten for working on this - this is looking great!
On the loading API side I think the overloading of load
in evaluate.load
and evaluate and evaluate.evaluation_suite.load
. What do you think about adding load
as a class method and going for the following API:
from evaluate import EvaluationSuite
suite = EvaluationSuite.load("...")
Then I think we can also integrate setup()
into __init__()
so we have less redundant methods?
Finally, what do you think about adding a __repr__()
or save_metadata()
method where one can display the suite in a nice format and save the tasks as JSON or YAML for reproducibility. Can also be in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome !
Don't forget to add some tests and docs ;)
|
||
|
||
@dataclass | ||
class SubTask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you end up with a better name for this class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding me! From the other PR thread, EvaluationJob
, EvaluationData
or DataSource
were suggested.
I assume that you're referring to the tasks defined at https://huggingface.co/tasks — in which case, maybe SubTask is actually quite fitting? For example, the tasks listed under the Text Classification task umbrella are NLI, QNLI, MNLI, QQP — which right now correspond 1:1 to the SubTask object in EvaluationSuite (see an example of GLUE as an EvaluationSuite defined here).
A flip through NLP + CV lit suggests that "Task" is really the canonical name for this type of thing — see reference to NLP tasks in Dynabench, SuperGLUE, the Eleuther LM Harness, and computer vision as well like in GRIT. I'd love to keep convention with the field as much as possible so it's obvious to newcomers what the atomic unit of an EvaluationSuite should be. WDYT?
ff8091d
to
96c3a0e
Compare
8309ad2
to
748e72d
Compare
Thanks @lvwerra @lhoestq @NimaBoscarino for all the thoughtful comments! The test case hosted under the evaluate org includes testing the data preprocessing. Let me know if you have other suggestions for things we should be checking. @NimaBoscarino has been battle-testing this with the new bias/fairness metrics and said that having the data preprocessing in code made it easy to override, so thank you all for helping to hash out the API for this feature :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mathemakitten this looks great! I think we can extend the tests a bit and make a dedicated test file e.g. test_evaluation_suite.py
. Things I would test in addition:
- does it work with and without preprocessor - also check expected result.
- it think it's good to load the suite on the hub one, but for the rest you can define local test suites. like in the evaluator you can also use a dummy pipeline to avoid loading actual models (the CI is getting slow enough already :( )
- maybe there are some failure cases we should check? E.g. should it throw an error if the
suite
is an empty list?
tests/test_evaluation_suite.py
Outdated
|
||
class TestEvaluationSuite(TestCase): | ||
def test_suite(self): | ||
suite = EvaluationSuite.load("evaluate/evaluation-suite-ci") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to just mock evaluator
for this? Or for a separate test? That way we can test a bunch of different EvaluationSuite configs without actually running inferences, which would be useful for TDD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I'd like to make sure we have at least one test which actually tests loading from a script on the Hub like this, but mocking for the rest to save compute/time.
833cc78
to
e64cd83
Compare
e64cd83
to
f8da2a6
Compare
2dac7fe
to
3ab6be7
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few minor nits, then it's ready to go :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Can you add some documentation on this ? Maybe with a dedicated page in the docs ?
Oh actually just saw that you opened #340 - nervermind |
See here for an example of how to use it.
Run GLUE with:
Similarly, a simpler sentiment analysis benchmark exists at
mathemakitten/sentiment-evaluation-suite
Supercedes #302.