-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add worker_queue_request, batch_queue and worker_processes list metrics #826
Add worker_queue_request, batch_queue and worker_processes list metrics #826
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alvarorsant ,
Thanks for picking back up this effort 👍
I've added a couple comments below. Overall, I feel we should keep these changes small, and focus just on tracking the requests queues.
Besides that, and the review comments, could you also add a couple tests to ensure the new metrics get exposed correctly? You can model these off the tests under tests.metrics
, like the following one:
MLServer/tests/metrics/test_rest.py
Lines 19 to 45 in f5297d4
async def test_rest_metrics( | |
metrics_client: MetricsClient, | |
rest_client: RESTClient, | |
inference_request: InferenceRequest, | |
sum_model: MLModel, | |
): | |
await rest_client.wait_until_ready() | |
metric_name = "rest_server_requests" | |
# Get metrics for gRPC server before sending any requests | |
metrics = await metrics_client.metrics() | |
rest_server_requests = find_metric(metrics, metric_name) | |
assert rest_server_requests is None | |
expected_handled = 5 | |
await asyncio.gather( | |
*[ | |
rest_client.infer(sum_model.name, inference_request) | |
for _ in range(expected_handled) | |
] | |
) | |
metrics = await metrics_client.metrics() | |
rest_server_requests = find_metric(metrics, metric_name) | |
assert rest_server_requests is not None | |
assert len(rest_server_requests.samples) == 1 | |
assert rest_server_requests.samples[0].value == expected_handled |
self._batching_task = schedule_with_callback( | ||
self._batcher(), self._batching_task_callback | ||
) | ||
self._batching_task = asyncio.create_task(self._batcher()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason to replace the schedule_with_callback
call? Under the hood, schedule_with_callback
will essentially call the same things (it's essentially a helper to avoid duplicating these two lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't touch that part. It must be a matter of updating issue to the master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance this was a leftover from the previous PR? Either way, could you try to rebase from master
to see if it goes away?
def _workers_processes_monitor(self, loop: AbstractEventLoop): | ||
process_request_count.observe(float(len(asyncio.all_tasks(loop)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop
would only represent the main process' loop. So it wouldn't capture the workers' queued asyncio tasks.
For this first iteration, I'd keep it simple and focus just on tracking the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I will focus on only in queue requests in this PR. I will do a couple of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding to the tests, I noticed you mount an aiohttp server in a fixture however I'd need to test the metrics inside the code in mlserver folder and I don't know how to fit real metrics in a mocked server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be the same as in the current metrics tests? Those spin up an actual MLServer instance, send a couple requests, and then scrape the metrics endpoint to compare the metrics before and after (and assert that they've increased).
Closing in behalf of #860 |
Add worker_queue_request, batch_queue and worker_processes_list metrics