Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluator perf #178

Merged
merged 20 commits into from
Jul 19, 2022
Merged

Evaluator perf #178

merged 20 commits into from
Jul 19, 2022

Conversation

ola13
Copy link
Contributor

@ola13 ola13 commented Jul 6, 2022

Adding perf metrics to evaluate - if user includes perf in the Evaluator strategy we return two additional metrics:

  • latency - total evaluation runtime in seconds
  • throughput - average number of samples per second

@ola13 ola13 requested review from lvwerra and sashavor July 6, 2022 13:18
@ola13 ola13 linked an issue Jul 6, 2022 that may be closed by this pull request
@ola13 ola13 requested a review from lewtun July 6, 2022 13:20
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 6, 2022

The documentation is not available anymore as the PR was closed or merged.

specifies the evaluation strategy. Possible values are:
- `"simple"` - we evaluate the metric and return the scores.
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each
of the returned metric keys, using `scipy`'s `bootstrap` method
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
- `"perf"` - on top of computing the metric scores, we calculate the performence metrics -
Copy link
Contributor

@fxmarty fxmarty Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could be more explicit stating that the time perfs are evaluating the metric (evaluation could be vague)?

I think evaluating the pipeline time perfs may be interesting as well but it is more "unfair" to the underlying model as the pipeline adds a bit of overhead.

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ola13

Thanks for working on this! I have two main comments on this:

  • I think we should be measuring the performance on the pipeline inference. @lewtun and @philschmid requested this feature to be able to quickly compare a few models (clearly this might not be the most optimized with the pipeline but should give a ballpark figure). I don't think the metric computation is particularly interesting.
  • I think we can return that additional information by default so we don't need an additional strategy just for that.

What do you think?

@fxmarty
Copy link
Contributor

fxmarty commented Jul 7, 2022

@lvwerra Yes I agree that evaluating pipeline time metrics is probably more interesting. However this is hardware/load dependent, and we may have to specify which batch size, sequence length we want to evaluate on. I am not sure we would want this evaluation to run by default, as the number of combination (batch size, sequence length) may be a bit large.

Borrowing from tune repo, I implemented something (rather ugly for the input names if you ask me) with optuna here: https://github.com/huggingface/optimum/blob/3e95b355ed0469ffeea13109312cd8d9933a70c7/optimum/onnxruntime/runs/__init__.py#L103-L130 and https://github.com/huggingface/optimum/blob/main/optimum/runs_base.py#L182-L266

@lvwerra
Copy link
Member

lvwerra commented Jul 7, 2022

I agree that this would be necessary to do a thorough performance analysis of the model. But since the focus is on the evaluation I would for example not change the sequence length since that's likely to make the task performance worse.

I let @lewtun and @philschmid say what they had in mind but I think just having this rudimentary measurement helps e.g. when you evaluate 10 models on the same dataset. You will see which ones are faster/slower. Maybe this is also a more realistic setup to production with different sequence lengths. If one wanted a more standardized performance evaluation they could still pass a dummy dataset with fixed sequence length and explore different batch sizes.

I think we should treat the pipeline as a black box and let the user modify/improve it (e.g. change the default batch size) outside the evaluator.

@ola13
Copy link
Contributor Author

ola13 commented Jul 7, 2022

I think we should be measuring the performance on the pipeline inference

Oh yeah that for sure. That's what I started doing pre-refactor and then got confused 🤦‍♀️

return that additional information by default

My only concern is that it makes the output look cluttered, so my intuition would be to leave it optional.

I think we should treat the pipeline as a black box and let the user modify/improve it (e.g. change the default batch size) outside the evaluator.

I agree with @lvwerra here, but let me know if you feel strongly about it @fxmarty

@lvwerra
Copy link
Member

lvwerra commented Jul 7, 2022

My only concern is that it makes the output look cluttered, so my intuition would be to leave it optional.

I agree that this might happen, but I think generally it's a nice to have and if a lot of people complain we can still add a flag later. That way everybody will see the feature when they first use the evaluator which is nice. Otherwise it might become a hidden treasure.

@lewtun
Copy link
Member

lewtun commented Jul 7, 2022

Thanks a lot for working on this @ola13 🚀 !

What @lvwerra says about treating the pipeline as a black box is close to what I had in mind with this feature.

Ideally, we want to get a rough idea of how long it takes to feed the inputs through the pipeline (this line is the one to time) and computing them by default makes sense IMO (similar to how the transformers.Trainer reports runtime metrics by default).

Happy to review this more closely when you've refactored the code a bit :)

Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall. Left a few comments.

The main point is figuring out a general way to wrap the performance measure. It is not great if we need to add it for every task. We can fix this in a subsequent PR but we should do it soon for new evaluators.

pred_label = [max(pred, key=lambda x: x["score"])["label"] for pred in predictions]
predictions = [label_mapping[pred] if label_mapping is not None else pred for pred in pred_label]

result = Evaluator._compute_time_perf(start_time, end_time, len(predictions))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we inhereit we can just do this, right?

Suggested change
result = Evaluator._compute_time_perf(start_time, end_time, len(predictions))
result = self._compute_time_perf(start_time, end_time, len(predictions))

Comment on lines 133 to 135
start_time = perf_counter()
predictions = pipe(data[input_column], truncation=True)
end_time = perf_counter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably refactor this such that the pipe call is wrapped in the base class and we don't need to implement the perf logic for every task.

In the parent class we would have something like this:

def compute_predictions_with_perf(self):
    start_time = perf_counter()
    predictions = compute_predictions()
    end_time = perf_counter()
    return predictions, self._compute_time_perf(start_time, end_time, len(predictions))

And in the task specific class we define the logic for the task:

def compute_predictions(self):
    return self.pipe(self.data[self.input_column], truncation=True)

Not 100% sure that's exactly the way to do it but I think this would make it easier for future additions. We don't need to tackle this in this PR necessarily but we should come up with a scalable solution next.

cc @fxmarty who's been working on other evaluators.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again I think what would make most sense right now is to only have compute_predictions() defined in child classes, while store the whole compute logic in the base class without splitting into compute and core_compute - I'll go ahead and try it out will ping you to check it out after

Copy link
Contributor

@fxmarty fxmarty Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion on this, but I think a context manager could make sense here. Something like:

with MeasureTime() as timing:
    # compute predictions here, not necessarily in a separate method?

results = {"latency": timing.latency, "throughput": timing.throughput}

Otherwise what we currently have, or your suggestion Leandro are very fine.

I am not sure about having compute() in the parent class, how would custom cases be treated (like https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L172-L178 )? Redefine compute() in the child class?

The PR looks good otherwise!

tests/test_evaluator.py Show resolved Hide resolved
Comment on lines 165 to 177
model = AutoModelForSequenceClassification.from_pretrained(self.default_model)
tokenizer = AutoTokenizer.from_pretrained(self.default_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the actual pipe loading tests I would use the dummy pipeline to make sure the tests run fast. Otherwise we load distilbert n-times which is not great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I'll send the fix in a separate PR since it's not directly related to this one (uless you think I should just deal with it here?)

@ola13 ola13 requested a review from lvwerra July 8, 2022 15:43
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict`
containing the score, the confidence interval and the standard error calculated for each metric key.
Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about aggregating examples here, and actually the unified compute() documentation. For example, the default may be different depending on the subclass (TextClassificationEvaluator, TokenClassificationEvaluator). We may have different arguments for different subclass compute(), for example https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L92-L93 .

Maybe we could do something like:

@add_start_docstrings(COMPUTE_ARGS)
@add_start_docstrings(BOOTSTRAP_ARGS)
def compute(
    self,
    subclass_specific_arg1,
    subclass_specific_arg2,
    **kwargs,
) -> Tuple[Dict[str, float], Any]:
    """
    subclass-specific doc here
    """
    super().compute(...)  # or redefine compute here if necessary, passing subclass specific args to the parent's compute() is not defined

I feel like the parent's compute() will rapidly be irrelevant for specific subclasses if we add them, but I'm not sure. For question-answering it would not fit, I think.

@lvwerra
Copy link
Member

lvwerra commented Jul 12, 2022

Thanks @ola13 for working on this and @fxmarty for raising valid points! I think it's true that our current predictions+references paradigm breaks down for some tasks since some pipelines will have more inputs (e.g. QA will have questions and contexts) and some metrics take different inputs (e.g. perplexity only predictions and some translation metrics also the source).

I tried to draw a diagram to draw a clearer boundary of the things that we should do in the specific class vs. the base class.
Screenshot 2022-07-12 at 15 46 12

The main point is that if we keep the objects that go into the pipeline/metric dicts we can be more flexible. E.g. the pipeline call in the base class would essentially be (note that the names don't need to be exactly this I just put in the first thing that came to mind):

def pipeline_call(self, pipeline_inputs):
      start = perf_counter()
      predictions = self.pipeline(**pipeline_inputs)
      end = perf_counter()
      self.result.update(self._compute_perf(...))
      return predictions

I think the compute method could stay in the base class but we could make it private and would essentially be:

def _compute(self, ...):

    # args housekeeping here

    self.metric = self.prepare_metric(metric)
    self.pipeline = self.prepare_metric(pipeline)
    pipeline_inputs, metric_inputs = self.prepare_data(data, ...)
    
    pipeline_outputs = self.pipeline_call(pipeline_inputs)
    pipeline_outputs = self.process(
    
    metric_inputs.update(pipeline_outputs)
    results = self.metric_call(metric_inputs)
    
    return results

That would allow for some customization if necessary in compute of the task class and if none are necessary this is simply:

def compute(self, ...):
    return self._compute(...)

What do you think about that proposal @ola13 and @fxmarty? Note that the functions maybe are a bit more complex but I tried to reflect just the main logic.

@fxmarty
Copy link
Contributor

fxmarty commented Jul 13, 2022

Sounds good to me! I am just curious about the args to put here in the base _compute():

pipeline_inputs, metric_inputs = self.prepare_data(data, ...)

Something like

def compute(self, ..., question_column, answers_column):
    self.column_names = ["question_column", "answers_column"]
    self._compute(..., column_names, label_column)

and

def _compute(self, ...):
   ...
   pipeline_inputs, metric_inputs = self.prepare_data(data, label_column, column_names)

? As we are just checking that they are in the dataset? But this design is more with a prepare_data() in the parent class so not sure.

@lvwerra
Copy link
Member

lvwerra commented Jul 13, 2022

That's a good point, since the args of compute might change slightly depending on the task it might be easiest to just keep the compute method in the task class.

@ola13
Copy link
Contributor Author

ola13 commented Jul 14, 2022

First I love the graph <3 thanks for putting in the effort @lvwerra

Now, my intuitions about polymorphism come from C++, so if there are major differences about how that should work in Python, please let me know.

But with that in mind, I do think that putting compute in task classes will just add unnecessary code duplication. I think it should be implemented in the base class, and we can override other methods in task classes as necessary and thanks to the polymorphism of the python abstract class, the instantiations of the child (task) classes will work as expected.

With that in mind we can have compute in base class and override methods such as prepare_data in task classes if necessary. That's what I tried to do in my latest revision in this PR, and tested that it works as expected. The difference between what what @lvwerra is proposing and what I did is having a unified pipeline_call instead of separate compute_references and conmpute_predictions which I agree is more general. The extra necessary arguments could be dealt with via kwargs. This would be relevant in the particular case of Question Answering where we can have the prepare_data implementation overridden in the task class as suggested here #179 (comment)

Let me know what you think!

@lvwerra
Copy link
Member

lvwerra commented Jul 14, 2022

Thanks for your thoughts @ola13! From a rigorous perspective what you say makes sense. I don't have a strong opinion about moving the compute to the parent class, too: I have a feeling we might adapt it a bit for most tasks (e.g. the image classification requires a feature extractor instead of a tokenizer) and QA requires some special argument parsing.

I am happy to put the compute method of text-classification as the default in the base class that may be useful to other tasks but I want to avoid spending a lot of time engineering a very generic solution that fits all. What do you think?

@ola13
Copy link
Contributor Author

ola13 commented Jul 14, 2022

sounds good! Do you want to do that as part of #185 and then I can adapt this PR ?

@ola13 ola13 requested review from lvwerra and removed request for lvwerra July 18, 2022 10:35
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes! Just some minor comments, I think we are almost good to go! 🚀

Comment on lines 156 to 158
start_time = perf_counter()
predictions = self.call_pipeline(pipe, pipe_inputs)
end_time = perf_counter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the timing inside the call_pipeline method? That would require less logic if one needs to rewrite the compute method.

Suggested change
start_time = perf_counter()
predictions = self.call_pipeline(pipe, pipe_inputs)
end_time = perf_counter()
predictions, perf_stats = self.call_pipeline(pipe, pipe_inputs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! I was about the case call_pipeline would be overwritten but I guess it makes more sense to have pipe perf metrics along with calling pipeline

self.task = "text-classification"

def __call__(self, text, **kwargs):
sleep(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a second class? We could just add a flag to the main dummy class at init so we can enable sleep. I am also in favour of sleep(0.1) or something to make sure we don't slow down the tests too much.

data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]})

results = self.evaluator.compute(
model_or_pipeline=self.model,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this test the perf dummy instead of the model?

random_state=0,
)
self.assertAlmostEqual(results["accuracy"], 0.666666, 5)
self.assertTrue("latency" in results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the dummy has a sleep - should we test that? otherwise we don't need to sleep, right?

@@ -160,6 +168,48 @@ def test_bootstrap(self):
self.assertAlmostEqual(results["accuracy"]["confidence_interval"][1], 0.68326, 5)
self.assertAlmostEqual(results["accuracy"]["standard_error"], 0.24595, 5)

def test_perf(self):
data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we use self.data here?

self.assertEqual(results["accuracy"], 0)


class TestEvaluatorPerf(TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit redundant with the test_perf/test_bootstrap_and_perf? I think we can just keep those tests as part of the the main test suite and don't need an extra test case for now. Wdyt? Then we don't need to define everything again.

@ola13 ola13 requested a review from lvwerra July 18, 2022 19:38
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@lvwerra lvwerra merged commit 6ef285c into main Jul 19, 2022
@lvwerra lvwerra deleted the evaluator-perf branch July 19, 2022 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add runtime metrics to Evaluator
5 participants