Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worker_queue_request, batch_queue and worker_processes list metrics #826

Closed

Conversation

alvarorsant
Copy link
Contributor

Add worker_queue_request, batch_queue and worker_processes_list metrics

Copy link
Contributor

@adriangonz adriangonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alvarorsant ,

Thanks for picking back up this effort 👍

I've added a couple comments below. Overall, I feel we should keep these changes small, and focus just on tracking the requests queues.

Besides that, and the review comments, could you also add a couple tests to ensure the new metrics get exposed correctly? You can model these off the tests under tests.metrics, like the following one:

async def test_rest_metrics(
metrics_client: MetricsClient,
rest_client: RESTClient,
inference_request: InferenceRequest,
sum_model: MLModel,
):
await rest_client.wait_until_ready()
metric_name = "rest_server_requests"
# Get metrics for gRPC server before sending any requests
metrics = await metrics_client.metrics()
rest_server_requests = find_metric(metrics, metric_name)
assert rest_server_requests is None
expected_handled = 5
await asyncio.gather(
*[
rest_client.infer(sum_model.name, inference_request)
for _ in range(expected_handled)
]
)
metrics = await metrics_client.metrics()
rest_server_requests = find_metric(metrics, metric_name)
assert rest_server_requests is not None
assert len(rest_server_requests.samples) == 1
assert rest_server_requests.samples[0].value == expected_handled

self._batching_task = schedule_with_callback(
self._batcher(), self._batching_task_callback
)
self._batching_task = asyncio.create_task(self._batcher())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason to replace the schedule_with_callback call? Under the hood, schedule_with_callback will essentially call the same things (it's essentially a helper to avoid duplicating these two lines).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't touch that part. It must be a matter of updating issue to the master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance this was a leftover from the previous PR? Either way, could you try to rebase from master to see if it goes away?

Comment on lines +119 to +120
def _workers_processes_monitor(self, loop: AbstractEventLoop):
process_request_count.observe(float(len(asyncio.all_tasks(loop))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop would only represent the main process' loop. So it wouldn't capture the workers' queued asyncio tasks.

For this first iteration, I'd keep it simple and focus just on tracking the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I will focus on only in queue requests in this PR. I will do a couple of tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding to the tests, I noticed you mount an aiohttp server in a fixture however I'd need to test the metrics inside the code in mlserver folder and I don't know how to fit real metrics in a mocked server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be the same as in the current metrics tests? Those spin up an actual MLServer instance, send a couple requests, and then scrape the metrics endpoint to compare the metrics before and after (and assert that they've increased).

@adriangonz
Copy link
Contributor

Closing in behalf of #860

@adriangonz adriangonz closed this Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants