Health Monitor continues to log after errors #2523

cunnie · 2024-05-21T23:01:35Z

When health_monitor is sending metrics to, for example, graphite, and there is a network error (e.g. Errno::EPIPE), it stops sending metrics until health_monitor is restarted.

This was a regression due to replacing the EventMachine gem with the Async gem.

This commit fixes the regression by, when a network error occurs, attempting to re-establish the connection and continue sending data.

The attempt to re-establish the connection follows the same retry & backoff logic as when establishing the initial connection.

Note: we feel the method unbind is poorly named; it should be named close_old_and_open_new_connection.

[fixes #2522]

[#187636407]

What is this change about?

See commit message.

Please provide contextual information.

See #2522.

What tests have you run against this PR?

be rake spec:unit:parallel

How should this change be described in bosh release notes?

Health Monitor attempts to re-establish plugins' network connections when disconnected.

Does this PR introduce a breaking change?

No.

Tag your pair, your PM, and/or team!

@mingxiao @cunnie

When health_monitor is sending metrics to, for example, graphite, and there is a network error (e.g. `Errno::EPIPE`), it stops sending metrics until health_monitor is restarted. This was a regression due to replacing the `EventMachine` gem with the `Async` gem. This commit fixes the regression by, when a network error occurs, attempting to re-establish the connection and continue sending data. The attempt to re-establish the connection follows the same retry & backoff logic as when establishing the initial connection. Note: we feel the method `unbind` is poorly named; it should be named `close_old_and_open_new_connection`. [fixes #2522] [#187636407] Signed-off-by: Brian Cunnie <brian.cunnie@broadcom.com>

klakin-pivotal

The change seems reasonable to me.

However, after some back-channel communication with the authors, I must note that neither I nor they know exactly how EventMachine handled failures, so (while it seems like a reasonable change) we're not sure that this is a workalike replacement.

cunnie requested a review from klakin-pivotal May 21, 2024 23:02

klakin-pivotal approved these changes May 21, 2024

View reviewed changes

cunnie merged commit 82ef523 into main May 21, 2024
4 checks passed

ystros deleted the health-monitor-resilient-logs branch May 29, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health Monitor continues to log after errors #2523

Health Monitor continues to log after errors #2523

cunnie commented May 21, 2024

klakin-pivotal left a comment

Health Monitor continues to log after errors #2523

Health Monitor continues to log after errors #2523

Conversation

cunnie commented May 21, 2024

What is this change about?

Please provide contextual information.

What tests have you run against this PR?

How should this change be described in bosh release notes?

Does this PR introduce a breaking change?

Tag your pair, your PM, and/or team!

klakin-pivotal left a comment

Choose a reason for hiding this comment