Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Monitor continues to log after errors #2523

Merged
merged 1 commit into from
May 21, 2024

Conversation

cunnie
Copy link
Member

@cunnie cunnie commented May 21, 2024

When health_monitor is sending metrics to, for example, graphite, and there is a network error (e.g. Errno::EPIPE), it stops sending metrics until health_monitor is restarted.

This was a regression due to replacing the EventMachine gem with the Async gem.

This commit fixes the regression by, when a network error occurs, attempting to re-establish the connection and continue sending data.

The attempt to re-establish the connection follows the same retry & backoff logic as when establishing the initial connection.

Note: we feel the method unbind is poorly named; it should be named close_old_and_open_new_connection.

[fixes #2522]

[#187636407]

What is this change about?

See commit message.

Please provide contextual information.

See #2522.

What tests have you run against this PR?

be rake spec:unit:parallel

How should this change be described in bosh release notes?

Health Monitor attempts to re-establish plugins' network connections when disconnected.

Does this PR introduce a breaking change?

No.

Tag your pair, your PM, and/or team!

@mingxiao @cunnie

When health_monitor is sending metrics to, for example, graphite, and
there is a network error (e.g. `Errno::EPIPE`), it stops sending metrics
until health_monitor is restarted.

This was a regression due to replacing the `EventMachine` gem with the `Async`
gem.

This commit fixes the regression by, when a network error occurs,
attempting to re-establish the connection and continue sending data.

The attempt to re-establish the connection follows the same retry &
backoff logic as when establishing the initial connection.

Note: we feel the method `unbind` is poorly named; it should be named
`close_old_and_open_new_connection`.

[fixes #2522]

[#187636407]

Signed-off-by: Brian Cunnie <brian.cunnie@broadcom.com>
Copy link
Contributor

@klakin-pivotal klakin-pivotal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable to me.

However, after some back-channel communication with the authors, I must note that neither I nor they know exactly how EventMachine handled failures, so (while it seems like a reasonable change) we're not sure that this is a workalike replacement.

@cunnie cunnie merged commit 82ef523 into main May 21, 2024
4 checks passed
@ystros ystros deleted the health-monitor-resilient-logs branch May 29, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Health_Monitor stop sending logs
3 participants