Evaluator perf #178

ola13 · 2022-07-06T13:18:45Z

Adding perf metrics to evaluate - if user includes perf in the Evaluator strategy we return two additional metrics:

latency - total evaluation runtime in seconds
throughput - average number of samples per second

HuggingFaceDocBuilderDev · 2022-07-06T13:24:02Z

The documentation is not available anymore as the PR was closed or merged.

fxmarty · 2022-07-06T18:09:48Z

src/evaluate/evaluator/image_classification.py

                specifies the evaluation strategy. Possible values are:
                - `"simple"` - we evaluate the metric and return the scores.
                - `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
                of the returned metric keys, using `scipy`'s `bootstrap` method
                https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
+                - `"perf"` - on top of computing the metric scores, we calculate the performence metrics -


Maybe we could be more explicit stating that the time perfs are evaluating the metric (evaluation could be vague)?

I think evaluating the pipeline time perfs may be interesting as well but it is more "unfair" to the underlying model as the pipeline adds a bit of overhead.

lvwerra

Hi @ola13

Thanks for working on this! I have two main comments on this:

I think we should be measuring the performance on the pipeline inference. @lewtun and @philschmid requested this feature to be able to quickly compare a few models (clearly this might not be the most optimized with the pipeline but should give a ballpark figure). I don't think the metric computation is particularly interesting.
I think we can return that additional information by default so we don't need an additional strategy just for that.

What do you think?

fxmarty · 2022-07-07T08:47:05Z

@lvwerra Yes I agree that evaluating pipeline time metrics is probably more interesting. However this is hardware/load dependent, and we may have to specify which batch size, sequence length we want to evaluate on. I am not sure we would want this evaluation to run by default, as the number of combination (batch size, sequence length) may be a bit large.

Borrowing from tune repo, I implemented something (rather ugly for the input names if you ask me) with optuna here: https://github.com/huggingface/optimum/blob/3e95b355ed0469ffeea13109312cd8d9933a70c7/optimum/onnxruntime/runs/__init__.py#L103-L130 and https://github.com/huggingface/optimum/blob/main/optimum/runs_base.py#L182-L266

lvwerra · 2022-07-07T09:12:34Z

I agree that this would be necessary to do a thorough performance analysis of the model. But since the focus is on the evaluation I would for example not change the sequence length since that's likely to make the task performance worse.

I let @lewtun and @philschmid say what they had in mind but I think just having this rudimentary measurement helps e.g. when you evaluate 10 models on the same dataset. You will see which ones are faster/slower. Maybe this is also a more realistic setup to production with different sequence lengths. If one wanted a more standardized performance evaluation they could still pass a dummy dataset with fixed sequence length and explore different batch sizes.

I think we should treat the pipeline as a black box and let the user modify/improve it (e.g. change the default batch size) outside the evaluator.

ola13 · 2022-07-07T09:19:39Z

I think we should be measuring the performance on the pipeline inference

Oh yeah that for sure. That's what I started doing pre-refactor and then got confused 🤦‍♀️

return that additional information by default

My only concern is that it makes the output look cluttered, so my intuition would be to leave it optional.

I think we should treat the pipeline as a black box and let the user modify/improve it (e.g. change the default batch size) outside the evaluator.

I agree with @lvwerra here, but let me know if you feel strongly about it @fxmarty

lvwerra · 2022-07-07T09:33:46Z

My only concern is that it makes the output look cluttered, so my intuition would be to leave it optional.

I agree that this might happen, but I think generally it's a nice to have and if a lot of people complain we can still add a flag later. That way everybody will see the feature when they first use the evaluator which is nice. Otherwise it might become a hidden treasure.

lewtun · 2022-07-07T15:14:14Z

Thanks a lot for working on this @ola13 🚀 !

What @lvwerra says about treating the pipeline as a black box is close to what I had in mind with this feature.

Ideally, we want to get a rough idea of how long it takes to feed the inputs through the pipeline (this line is the one to time) and computing them by default makes sense IMO (similar to how the transformers.Trainer reports runtime metrics by default).

Happy to review this more closely when you've refactored the code a bit :)

lvwerra

Looks pretty good overall. Left a few comments.

The main point is figuring out a general way to wrap the performance measure. It is not great if we need to add it for every task. We can fix this in a subsequent PR but we should do it soon for new evaluators.

lvwerra · 2022-07-08T09:37:51Z

src/evaluate/evaluator/image_classification.py

        pred_label = [max(pred, key=lambda x: x["score"])["label"] for pred in predictions]
        predictions = [label_mapping[pred] if label_mapping is not None else pred for pred in pred_label]

+        result = Evaluator._compute_time_perf(start_time, end_time, len(predictions))


Since we inhereit we can just do this, right?

Suggested change

result = Evaluator._compute_time_perf(start_time, end_time, len(predictions))

result = self._compute_time_perf(start_time, end_time, len(predictions))

lvwerra · 2022-07-08T12:14:43Z

src/evaluate/evaluator/text_classification.py

+        start_time = perf_counter()
        predictions = pipe(data[input_column], truncation=True)
+        end_time = perf_counter()


We should probably refactor this such that the pipe call is wrapped in the base class and we don't need to implement the perf logic for every task.

In the parent class we would have something like this:

def compute_predictions_with_perf(self): start_time = perf_counter() predictions = compute_predictions() end_time = perf_counter() return predictions, self._compute_time_perf(start_time, end_time, len(predictions))

And in the task specific class we define the logic for the task:

def compute_predictions(self): return self.pipe(self.data[self.input_column], truncation=True)

Not 100% sure that's exactly the way to do it but I think this would make it easier for future additions. We don't need to tackle this in this PR necessarily but we should come up with a scalable solution next.

cc @fxmarty who's been working on other evaluators.

What do you think?

Looking at it again I think what would make most sense right now is to only have compute_predictions() defined in child classes, while store the whole compute logic in the base class without splitting into compute and core_compute - I'll go ahead and try it out will ping you to check it out after

I have no strong opinion on this, but I think a context manager could make sense here. Something like:

with MeasureTime() as timing: # compute predictions here, not necessarily in a separate method? results = {"latency": timing.latency, "throughput": timing.throughput}

Otherwise what we currently have, or your suggestion Leandro are very fine.

I am not sure about having compute() in the parent class, how would custom cases be treated (like https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L172-L178 )? Redefine compute() in the child class?

The PR looks good otherwise!

tests/test_evaluator.py

lvwerra · 2022-07-08T12:21:30Z

tests/test_evaluator.py

+        model = AutoModelForSequenceClassification.from_pretrained(self.default_model)
+        tokenizer = AutoTokenizer.from_pretrained(self.default_model)


Besides the actual pipe loading tests I would use the dummy pipeline to make sure the tests run fast. Otherwise we load distilbert n-times which is not great.

sounds good, I'll send the fix in a separate PR since it's not directly related to this one (uless you think I should just deal with it here?)

fxmarty · 2022-07-11T14:10:34Z

src/evaluate/evaluator/base.py

+            A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
+            `"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
+            containing the score, the confidence interval and the standard error calculated for each metric key.
+        Examples:


I am not sure about aggregating examples here, and actually the unified compute() documentation. For example, the default may be different depending on the subclass (TextClassificationEvaluator, TokenClassificationEvaluator). We may have different arguments for different subclass compute(), for example https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L92-L93 .

Maybe we could do something like:

@add_start_docstrings(COMPUTE_ARGS) @add_start_docstrings(BOOTSTRAP_ARGS) def compute( self, subclass_specific_arg1, subclass_specific_arg2, **kwargs, ) -> Tuple[Dict[str, float], Any]: """ subclass-specific doc here """ super().compute(...) # or redefine compute here if necessary, passing subclass specific args to the parent's compute() is not defined

I feel like the parent's compute() will rapidly be irrelevant for specific subclasses if we add them, but I'm not sure. For question-answering it would not fit, I think.

lvwerra · 2022-07-12T14:58:43Z

Thanks @ola13 for working on this and @fxmarty for raising valid points! I think it's true that our current predictions+references paradigm breaks down for some tasks since some pipelines will have more inputs (e.g. QA will have questions and contexts) and some metrics take different inputs (e.g. perplexity only predictions and some translation metrics also the source).

I tried to draw a diagram to draw a clearer boundary of the things that we should do in the specific class vs. the base class.

The main point is that if we keep the objects that go into the pipeline/metric dicts we can be more flexible. E.g. the pipeline call in the base class would essentially be (note that the names don't need to be exactly this I just put in the first thing that came to mind):

def pipeline_call(self, pipeline_inputs):
      start = perf_counter()
      predictions = self.pipeline(**pipeline_inputs)
      end = perf_counter()
      self.result.update(self._compute_perf(...))
      return predictions

I think the compute method could stay in the base class but we could make it private and would essentially be:

def _compute(self, ...):

    # args housekeeping here

    self.metric = self.prepare_metric(metric)
    self.pipeline = self.prepare_metric(pipeline)
    pipeline_inputs, metric_inputs = self.prepare_data(data, ...)
    
    pipeline_outputs = self.pipeline_call(pipeline_inputs)
    pipeline_outputs = self.process(
    
    metric_inputs.update(pipeline_outputs)
    results = self.metric_call(metric_inputs)
    
    return results

That would allow for some customization if necessary in compute of the task class and if none are necessary this is simply:

def compute(self, ...):
    return self._compute(...)

What do you think about that proposal @ola13 and @fxmarty? Note that the functions maybe are a bit more complex but I tried to reflect just the main logic.

fxmarty · 2022-07-13T07:54:43Z

Sounds good to me! I am just curious about the args to put here in the base _compute():

pipeline_inputs, metric_inputs = self.prepare_data(data, ...)

Something like

def compute(self, ..., question_column, answers_column):
    self.column_names = ["question_column", "answers_column"]
    self._compute(..., column_names, label_column)

and

def _compute(self, ...):
   ...
   pipeline_inputs, metric_inputs = self.prepare_data(data, label_column, column_names)

? As we are just checking that they are in the dataset? But this design is more with a prepare_data() in the parent class so not sure.

lvwerra · 2022-07-13T08:06:19Z

That's a good point, since the args of compute might change slightly depending on the task it might be easiest to just keep the compute method in the task class.

ola13 · 2022-07-14T10:46:14Z

First I love the graph <3 thanks for putting in the effort @lvwerra

Now, my intuitions about polymorphism come from C++, so if there are major differences about how that should work in Python, please let me know.

But with that in mind, I do think that putting compute in task classes will just add unnecessary code duplication. I think it should be implemented in the base class, and we can override other methods in task classes as necessary and thanks to the polymorphism of the python abstract class, the instantiations of the child (task) classes will work as expected.

With that in mind we can have compute in base class and override methods such as prepare_data in task classes if necessary. That's what I tried to do in my latest revision in this PR, and tested that it works as expected. The difference between what what @lvwerra is proposing and what I did is having a unified pipeline_call instead of separate compute_references and conmpute_predictions which I agree is more general. The extra necessary arguments could be dealt with via kwargs. This would be relevant in the particular case of Question Answering where we can have the prepare_data implementation overridden in the task class as suggested here #179 (comment)

Let me know what you think!

lvwerra · 2022-07-14T14:12:37Z

Thanks for your thoughts @ola13! From a rigorous perspective what you say makes sense. I don't have a strong opinion about moving the compute to the parent class, too: I have a feeling we might adapt it a bit for most tasks (e.g. the image classification requires a feature extractor instead of a tokenizer) and QA requires some special argument parsing.

I am happy to put the compute method of text-classification as the default in the base class that may be useful to other tasks but I want to avoid spending a lot of time engineering a very generic solution that fits all. What do you think?

ola13 · 2022-07-14T14:23:08Z

sounds good! Do you want to do that as part of #185 and then I can adapt this PR ?

lvwerra

Thanks a lot for the changes! Just some minor comments, I think we are almost good to go! 🚀

lvwerra · 2022-07-18T14:50:27Z

src/evaluate/evaluator/base.py

+        start_time = perf_counter()
        predictions = self.call_pipeline(pipe, pipe_inputs)
+        end_time = perf_counter()


What do you think about moving the timing inside the call_pipeline method? That would require less logic if one needs to rewrite the compute method.

Suggested change

start_time = perf_counter()

predictions = self.call_pipeline(pipe, pipe_inputs)

end_time = perf_counter()

predictions, perf_stats = self.call_pipeline(pipe, pipe_inputs)

sounds good! I was about the case call_pipeline would be overwritten but I guess it makes more sense to have pipe perf metrics along with calling pipeline

lvwerra · 2022-07-18T15:00:21Z

tests/test_evaluator.py

+        self.task = "text-classification"
+
+    def __call__(self, text, **kwargs):
+        sleep(2)


Do we really need a second class? We could just add a flag to the main dummy class at init so we can enable sleep. I am also in favour of sleep(0.1) or something to make sure we don't slow down the tests too much.

lvwerra · 2022-07-18T15:01:03Z

tests/test_evaluator.py

+        data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]})
+
+        results = self.evaluator.compute(
+            model_or_pipeline=self.model,


Shouldn't this test the perf dummy instead of the model?

lvwerra · 2022-07-18T15:01:48Z

tests/test_evaluator.py

+            random_state=0,
+        )
+        self.assertAlmostEqual(results["accuracy"], 0.666666, 5)
+        self.assertTrue("latency" in results)


since the dummy has a sleep - should we test that? otherwise we don't need to sleep, right?

lvwerra · 2022-07-18T15:03:04Z

tests/test_evaluator.py

@@ -160,6 +168,48 @@ def test_bootstrap(self):
        self.assertAlmostEqual(results["accuracy"]["confidence_interval"][1], 0.68326, 5)
        self.assertAlmostEqual(results["accuracy"]["standard_error"], 0.24595, 5)

+    def test_perf(self):
+        data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]})


why can't we use self.data here?

lvwerra · 2022-07-18T15:08:40Z

tests/test_evaluator.py

+        self.assertEqual(results["accuracy"], 0)
+
+
+class TestEvaluatorPerf(TestCase):


This seems a bit redundant with the test_perf/test_bootstrap_and_perf? I think we can just keep those tests as part of the the main test suite and don't need an extra test case for now. Wdyt? Then we don't need to define everything again.

lvwerra

LGTM! 🚀

ola13 requested review from lvwerra and sashavor July 6, 2022 13:18

ola13 linked an issue Jul 6, 2022 that may be closed by this pull request

Add runtime metrics to Evaluator #125

Closed

ola13 requested a review from lewtun July 6, 2022 13:20

fxmarty reviewed Jul 6, 2022

View reviewed changes

lvwerra reviewed Jul 7, 2022

View reviewed changes

ola13 force-pushed the evaluator-perf branch from fd5c4be to 7676573 Compare July 7, 2022 15:29

ola13 mentioned this pull request Jul 7, 2022

Add QuestionAnsweringEvaluator #179

Merged

ola13 requested a review from lvwerra July 8, 2022 09:33

lvwerra reviewed Jul 8, 2022

View reviewed changes

ola13 requested a review from lvwerra July 8, 2022 15:43

fxmarty reviewed Jul 11, 2022

View reviewed changes

lvwerra mentioned this pull request Jul 13, 2022

refactor evaluator base and task classes #185

Merged

ola13 added 5 commits July 18, 2022 10:01

Evaluator perf strategy

b44c2dd

Evaluator perf tests

00d8c49

remove debug

0c85a01

fix docs

ab1e2f0

fix docs

0c76e05

ola13 added 7 commits July 18, 2022 10:02

fix docs

98842c4

fix docs

eb89516

Evaluator perf tests

232f318

remove debug

bc583b3

refactor

35bb6ad

fix imports and docs

7830844

refactor

539b9d2

ola13 force-pushed the evaluator-perf branch from f8ff75c to 539b9d2 Compare July 18, 2022 10:07

ola13 added 3 commits July 18, 2022 10:26

fix post-merge

1f3f0f3

improve tests

bc096da

remove debug

b5a6629

ola13 requested review from lvwerra and removed request for lvwerra July 18, 2022 10:35

lvwerra reviewed Jul 18, 2022

View reviewed changes

simplify_tests

68769fe

ola13 requested a review from lvwerra July 18, 2022 19:38

lvwerra approved these changes Jul 19, 2022

View reviewed changes

leandro added 4 commits July 19, 2022 10:38

Merge branch 'main' into evaluator-perf

4e51fef

update qa tests

aa81085

fix perf in qa evaluator

e775e43

fix style

3523d3a

lvwerra merged commit 6ef285c into main Jul 19, 2022

lvwerra deleted the evaluator-perf branch July 19, 2022 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator perf #178

Evaluator perf #178

ola13 commented Jul 6, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 6, 2022 •

edited

Loading

fxmarty Jul 6, 2022 •

edited

Loading

lvwerra left a comment

fxmarty commented Jul 7, 2022

lvwerra commented Jul 7, 2022

ola13 commented Jul 7, 2022

lvwerra commented Jul 7, 2022

lewtun commented Jul 7, 2022

lvwerra left a comment

lvwerra Jul 8, 2022

lvwerra Jul 8, 2022

ola13 Jul 8, 2022

fxmarty Jul 8, 2022 •

edited

Loading

lvwerra Jul 8, 2022

ola13 Jul 8, 2022

fxmarty Jul 11, 2022

lvwerra commented Jul 12, 2022

fxmarty commented Jul 13, 2022

lvwerra commented Jul 13, 2022

ola13 commented Jul 14, 2022

lvwerra commented Jul 14, 2022

ola13 commented Jul 14, 2022

lvwerra left a comment

lvwerra Jul 18, 2022

ola13 Jul 18, 2022

lvwerra Jul 18, 2022

lvwerra Jul 18, 2022

lvwerra Jul 18, 2022

lvwerra Jul 18, 2022

lvwerra Jul 18, 2022

lvwerra left a comment

	result = Evaluator._compute_time_perf(start_time, end_time, len(predictions))
	result = self._compute_time_perf(start_time, end_time, len(predictions))

		model = AutoModelForSequenceClassification.from_pretrained(self.default_model)
		tokenizer = AutoTokenizer.from_pretrained(self.default_model)

		self.assertEqual(results["accuracy"], 0)


		class TestEvaluatorPerf(TestCase):

Evaluator perf #178

Evaluator perf #178

Conversation

ola13 commented Jul 6, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Jul 6, 2022 • edited Loading

fxmarty Jul 6, 2022 • edited Loading

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

fxmarty commented Jul 7, 2022

lvwerra commented Jul 7, 2022

ola13 commented Jul 7, 2022

lvwerra commented Jul 7, 2022

lewtun commented Jul 7, 2022

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxmarty Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra commented Jul 12, 2022

fxmarty commented Jul 13, 2022

lvwerra commented Jul 13, 2022

ola13 commented Jul 14, 2022

lvwerra commented Jul 14, 2022

ola13 commented Jul 14, 2022

lvwerra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvwerra left a comment

Choose a reason for hiding this comment

ola13 commented Jul 6, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 6, 2022 •

edited

Loading

fxmarty Jul 6, 2022 •

edited

Loading

fxmarty Jul 8, 2022 •

edited

Loading