Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation suite #337

Merged
merged 13 commits into from
Nov 16, 2022
Merged

Evaluation suite #337

merged 13 commits into from
Nov 16, 2022

Conversation

mathemakitten
Copy link
Contributor

@mathemakitten mathemakitten commented Nov 1, 2022

See here for an example of how to use it.

Run GLUE with:

from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/glue-suite-v2')
results = suite.run("gpt2")

Similarly, a simpler sentiment analysis benchmark exists at mathemakitten/sentiment-evaluation-suite

Supercedes #302.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@NimaBoscarino NimaBoscarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This looks great to me, and having suites be subclasses of EvaluationSuite is nice because then it'll let power users extend functionality without having to impact the Evaluate library. (e.g. I can see situations where someone might want to both map and filter as part of the preprocessing)

I'm hyped to use it!

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mathemakitten for working on this - this is looking great!

On the loading API side I think the overloading of load in evaluate.load and evaluate and evaluate.evaluation_suite.load. What do you think about adding load as a class method and going for the following API:

from evaluate import EvaluationSuite
suite = EvaluationSuite.load("...")

Then I think we can also integrate setup() into __init__() so we have less redundant methods?

Finally, what do you think about adding a __repr__() or save_metadata() method where one can display the suite in a nice format and save the tasks as JSON or YAML for reproducibility. Can also be in a follow up PR.

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved
@lvwerra lvwerra requested a review from lhoestq November 2, 2022 09:25
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !
Don't forget to add some tests and docs ;)

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved
src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved


@dataclass
class SubTask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you end up with a better name for this class ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me! From the other PR thread, EvaluationJob, EvaluationData or DataSource were suggested.

I assume that you're referring to the tasks defined at https://huggingface.co/tasks — in which case, maybe SubTask is actually quite fitting? For example, the tasks listed under the Text Classification task umbrella are NLI, QNLI, MNLI, QQP — which right now correspond 1:1 to the SubTask object in EvaluationSuite (see an example of GLUE as an EvaluationSuite defined here).

A flip through NLP + CV lit suggests that "Task" is really the canonical name for this type of thing — see reference to NLP tasks in Dynabench, SuperGLUE, the Eleuther LM Harness, and computer vision as well like in GRIT. I'd love to keep convention with the field as much as possible so it's obvious to newcomers what the atomic unit of an EvaluationSuite should be. WDYT?

@mathemakitten
Copy link
Contributor Author

Thanks @lvwerra @lhoestq @NimaBoscarino for all the thoughtful comments! The test case hosted under the evaluate org includes testing the data preprocessing. Let me know if you have other suggestions for things we should be checking.

@NimaBoscarino has been battle-testing this with the new bias/fairness metrics and said that having the data preprocessing in code made it easy to override, so thank you all for helping to hash out the API for this feature :)

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mathemakitten this looks great! I think we can extend the tests a bit and make a dedicated test file e.g. test_evaluation_suite.py. Things I would test in addition:

  • does it work with and without preprocessor - also check expected result.
  • it think it's good to load the suite on the hub one, but for the rest you can define local test suites. like in the evaluator you can also use a dummy pipeline to avoid loading actual models (the CI is getting slow enough already :( )
  • maybe there are some failure cases we should check? E.g. should it throw an error if the suite is an empty list?


class TestEvaluationSuite(TestCase):
def test_suite(self):
suite = EvaluationSuite.load("evaluate/evaluation-suite-ci")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just mock evaluator for this? Or for a separate test? That way we can test a bunch of different EvaluationSuite configs without actually running inferences, which would be useful for TDD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I'd like to make sure we have at least one test which actually tests loading from a script on the Hub like this, but mocking for the rest to save compute/time.

@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few minor nits, then it's ready to go :)

src/evaluate/evaluation_suite/__init__.py Outdated Show resolved Hide resolved
src/evaluate/evaluation_suite/__init__.py Show resolved Hide resolved
tests/test_evaluation_suite.py Outdated Show resolved Hide resolved
@lvwerra lvwerra requested a review from lhoestq November 15, 2022 19:07
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Can you add some documentation on this ? Maybe with a dedicated page in the docs ?

@lhoestq
Copy link
Member

lhoestq commented Nov 16, 2022

Oh actually just saw that you opened #340 - nervermind

@mathemakitten mathemakitten merged commit a12836b into main Nov 16, 2022
@mathemakitten mathemakitten deleted the hn-evaluation-suite-v2 branch November 16, 2022 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants