-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluator perf #178
Evaluator perf #178
Conversation
The documentation is not available anymore as the PR was closed or merged. |
specifies the evaluation strategy. Possible values are: | ||
- `"simple"` - we evaluate the metric and return the scores. | ||
- `"bootstrap"` - on top of computing the metric scores, we calculate the confidence interval for each | ||
of the returned metric keys, using `scipy`'s `bootstrap` method | ||
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html. | ||
- `"perf"` - on top of computing the metric scores, we calculate the performence metrics - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could be more explicit stating that the time perfs are evaluating the metric (evaluation could be vague)?
I think evaluating the pipeline time perfs may be interesting as well but it is more "unfair" to the underlying model as the pipeline adds a bit of overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ola13
Thanks for working on this! I have two main comments on this:
- I think we should be measuring the performance on the pipeline inference. @lewtun and @philschmid requested this feature to be able to quickly compare a few models (clearly this might not be the most optimized with the
pipeline
but should give a ballpark figure). I don't think the metric computation is particularly interesting. - I think we can return that additional information by default so we don't need an additional strategy just for that.
What do you think?
@lvwerra Yes I agree that evaluating pipeline time metrics is probably more interesting. However this is hardware/load dependent, and we may have to specify which batch size, sequence length we want to evaluate on. I am not sure we would want this evaluation to run by default, as the number of combination (batch size, sequence length) may be a bit large. Borrowing from tune repo, I implemented something (rather ugly for the input names if you ask me) with optuna here: https://github.com/huggingface/optimum/blob/3e95b355ed0469ffeea13109312cd8d9933a70c7/optimum/onnxruntime/runs/__init__.py#L103-L130 and https://github.com/huggingface/optimum/blob/main/optimum/runs_base.py#L182-L266 |
I agree that this would be necessary to do a thorough performance analysis of the model. But since the focus is on the evaluation I would for example not change the sequence length since that's likely to make the task performance worse. I let @lewtun and @philschmid say what they had in mind but I think just having this rudimentary measurement helps e.g. when you evaluate 10 models on the same dataset. You will see which ones are faster/slower. Maybe this is also a more realistic setup to production with different sequence lengths. If one wanted a more standardized performance evaluation they could still pass a dummy dataset with fixed sequence length and explore different batch sizes. I think we should treat the pipeline as a black box and let the user modify/improve it (e.g. change the default batch size) outside the evaluator. |
Oh yeah that for sure. That's what I started doing pre-refactor and then got confused 🤦♀️
My only concern is that it makes the output look cluttered, so my intuition would be to leave it optional.
I agree with @lvwerra here, but let me know if you feel strongly about it @fxmarty |
I agree that this might happen, but I think generally it's a nice to have and if a lot of people complain we can still add a flag later. That way everybody will see the feature when they first use the evaluator which is nice. Otherwise it might become a hidden treasure. |
Thanks a lot for working on this @ola13 🚀 ! What @lvwerra says about treating the pipeline as a black box is close to what I had in mind with this feature. Ideally, we want to get a rough idea of how long it takes to feed the inputs through the pipeline (this line is the one to time) and computing them by default makes sense IMO (similar to how the Happy to review this more closely when you've refactored the code a bit :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall. Left a few comments.
The main point is figuring out a general way to wrap the performance measure. It is not great if we need to add it for every task. We can fix this in a subsequent PR but we should do it soon for new evaluators.
pred_label = [max(pred, key=lambda x: x["score"])["label"] for pred in predictions] | ||
predictions = [label_mapping[pred] if label_mapping is not None else pred for pred in pred_label] | ||
|
||
result = Evaluator._compute_time_perf(start_time, end_time, len(predictions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we inhereit we can just do this, right?
result = Evaluator._compute_time_perf(start_time, end_time, len(predictions)) | |
result = self._compute_time_perf(start_time, end_time, len(predictions)) |
start_time = perf_counter() | ||
predictions = pipe(data[input_column], truncation=True) | ||
end_time = perf_counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably refactor this such that the pipe call is wrapped in the base class and we don't need to implement the perf logic for every task.
In the parent class we would have something like this:
def compute_predictions_with_perf(self):
start_time = perf_counter()
predictions = compute_predictions()
end_time = perf_counter()
return predictions, self._compute_time_perf(start_time, end_time, len(predictions))
And in the task specific class we define the logic for the task:
def compute_predictions(self):
return self.pipe(self.data[self.input_column], truncation=True)
Not 100% sure that's exactly the way to do it but I think this would make it easier for future additions. We don't need to tackle this in this PR necessarily but we should come up with a scalable solution next.
cc @fxmarty who's been working on other evaluators.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again I think what would make most sense right now is to only have compute_predictions()
defined in child classes, while store the whole compute
logic in the base class without splitting into compute
and core_compute
- I'll go ahead and try it out will ping you to check it out after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion on this, but I think a context manager could make sense here. Something like:
with MeasureTime() as timing:
# compute predictions here, not necessarily in a separate method?
results = {"latency": timing.latency, "throughput": timing.throughput}
Otherwise what we currently have, or your suggestion Leandro are very fine.
I am not sure about having compute()
in the parent class, how would custom cases be treated (like https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L172-L178 )? Redefine compute()
in the child class?
The PR looks good otherwise!
tests/test_evaluator.py
Outdated
model = AutoModelForSequenceClassification.from_pretrained(self.default_model) | ||
tokenizer = AutoTokenizer.from_pretrained(self.default_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the actual pipe loading tests I would use the dummy pipeline to make sure the tests run fast. Otherwise we load distilbert n-times which is not great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, I'll send the fix in a separate PR since it's not directly related to this one (uless you think I should just deal with it here?)
src/evaluate/evaluator/base.py
Outdated
A `Dict`. The keys represent metric keys calculated for the `metric` spefied in function arguments. For the | ||
`"simple"` strategy, the value is the metric score. For the `"bootstrap"` strategy, the value is a `Dict` | ||
containing the score, the confidence interval and the standard error calculated for each metric key. | ||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about aggregating examples here, and actually the unified compute()
documentation. For example, the default may be different depending on the subclass (TextClassificationEvaluator
, TokenClassificationEvaluator
). We may have different arguments for different subclass compute()
, for example https://github.com/fxmarty/evaluate/blob/540956fb081cf909d29df02cb70dbab7db828b37/src/evaluate/evaluator/question_answering.py#L92-L93 .
Maybe we could do something like:
@add_start_docstrings(COMPUTE_ARGS)
@add_start_docstrings(BOOTSTRAP_ARGS)
def compute(
self,
subclass_specific_arg1,
subclass_specific_arg2,
**kwargs,
) -> Tuple[Dict[str, float], Any]:
"""
subclass-specific doc here
"""
super().compute(...) # or redefine compute here if necessary, passing subclass specific args to the parent's compute() is not defined
I feel like the parent's compute()
will rapidly be irrelevant for specific subclasses if we add them, but I'm not sure. For question-answering it would not fit, I think.
Thanks @ola13 for working on this and @fxmarty for raising valid points! I think it's true that our current I tried to draw a diagram to draw a clearer boundary of the things that we should do in the specific class vs. the base class. The main point is that if we keep the objects that go into the pipeline/metric def pipeline_call(self, pipeline_inputs):
start = perf_counter()
predictions = self.pipeline(**pipeline_inputs)
end = perf_counter()
self.result.update(self._compute_perf(...))
return predictions I think the def _compute(self, ...):
# args housekeeping here
self.metric = self.prepare_metric(metric)
self.pipeline = self.prepare_metric(pipeline)
pipeline_inputs, metric_inputs = self.prepare_data(data, ...)
pipeline_outputs = self.pipeline_call(pipeline_inputs)
pipeline_outputs = self.process(
metric_inputs.update(pipeline_outputs)
results = self.metric_call(metric_inputs)
return results That would allow for some customization if necessary in def compute(self, ...):
return self._compute(...) What do you think about that proposal @ola13 and @fxmarty? Note that the functions maybe are a bit more complex but I tried to reflect just the main logic. |
Sounds good to me! I am just curious about the args to put here in the base
Something like def compute(self, ..., question_column, answers_column):
self.column_names = ["question_column", "answers_column"]
self._compute(..., column_names, label_column) and def _compute(self, ...):
...
pipeline_inputs, metric_inputs = self.prepare_data(data, label_column, column_names) ? As we are just checking that they are in the dataset? But this design is more with a |
That's a good point, since the args of compute might change slightly depending on the task it might be easiest to just keep the |
First I love the graph <3 thanks for putting in the effort @lvwerra Now, my intuitions about polymorphism come from C++, so if there are major differences about how that should work in Python, please let me know. But with that in mind, I do think that putting With that in mind we can have Let me know what you think! |
Thanks for your thoughts @ola13! From a rigorous perspective what you say makes sense. I don't have a strong opinion about moving the I am happy to put the |
sounds good! Do you want to do that as part of #185 and then I can adapt this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes! Just some minor comments, I think we are almost good to go! 🚀
src/evaluate/evaluator/base.py
Outdated
start_time = perf_counter() | ||
predictions = self.call_pipeline(pipe, pipe_inputs) | ||
end_time = perf_counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving the timing inside the call_pipeline
method? That would require less logic if one needs to rewrite the compute
method.
start_time = perf_counter() | |
predictions = self.call_pipeline(pipe, pipe_inputs) | |
end_time = perf_counter() | |
predictions, perf_stats = self.call_pipeline(pipe, pipe_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! I was about the case call_pipeline
would be overwritten but I guess it makes more sense to have pipe perf metrics along with calling pipeline
tests/test_evaluator.py
Outdated
self.task = "text-classification" | ||
|
||
def __call__(self, text, **kwargs): | ||
sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a second class? We could just add a flag to the main dummy class at init so we can enable sleep. I am also in favour of sleep(0.1)
or something to make sure we don't slow down the tests too much.
tests/test_evaluator.py
Outdated
data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]}) | ||
|
||
results = self.evaluator.compute( | ||
model_or_pipeline=self.model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this test the perf dummy instead of the model?
tests/test_evaluator.py
Outdated
random_state=0, | ||
) | ||
self.assertAlmostEqual(results["accuracy"], 0.666666, 5) | ||
self.assertTrue("latency" in results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the dummy has a sleep
- should we test that? otherwise we don't need to sleep, right?
tests/test_evaluator.py
Outdated
@@ -160,6 +168,48 @@ def test_bootstrap(self): | |||
self.assertAlmostEqual(results["accuracy"]["confidence_interval"][1], 0.68326, 5) | |||
self.assertAlmostEqual(results["accuracy"]["standard_error"], 0.24595, 5) | |||
|
|||
def test_perf(self): | |||
data = Dataset.from_dict({"label": [1, 0, 0], "text": ["great movie", "great movie", "horrible movie"]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use self.data
here?
tests/test_evaluator.py
Outdated
self.assertEqual(results["accuracy"], 0) | ||
|
||
|
||
class TestEvaluatorPerf(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit redundant with the test_perf
/test_bootstrap_and_perf
? I think we can just keep those tests as part of the the main test suite and don't need an extra test case for now. Wdyt? Then we don't need to define everything again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
Adding perf metrics to evaluate - if user includes
perf
in theEvaluator
strategy we return two additional metrics:latency
- total evaluation runtime in secondsthroughput
- average number of samples per second