-
Notifications
You must be signed in to change notification settings - Fork 657
-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health_Monitor stop sending logs #2522
Comments
When health_monitor is sending metrics to, for example, graphite, and there is a network error (e.g. `Errno::EPIPE`), it stops sending metrics until health_monitor is restarted. This was a regression due to replacing the `EventMachine` gem with the `Async` gem. This commit fixes the regression by, when a network error occurs, attempting to re-establish the connection and continue sending data. The attempt to re-establish the connection follows the same retry & backoff logic as when establishing the initial connection. Note: we feel the method `unbind` is poorly named; it should be named `close_old_and_open_new_connection`. [fixes #2522] [#187636407] Signed-off-by: Brian Cunnie <brian.cunnie@broadcom.com>
@benjaminguttmann-avtq , does v280.0.24 fix the issue for you? We had some tests using |
Just kidding, It seems like 280.0.24 is even more broken due to a change in the async gem no longer requiring async-io. We're adding a new require to fix that and hopefully everything is fixed then. |
Describe the bug
Health_Monitor stops sending out metrics to graphite. After a restart of health_monitor it starts working again.
To Reproduce
There is no clear path to reproduce right now.
Expected behavior
I expect that health_monitor sends metrics to the configured graphite plugin.
Logs
{"time":"2024-05-16T10:46:51+00:00","severity":"warn","oid":761205800,"pid":6,"subject":"Async::Task","message":["Task may have ended with unhandled exception.","Broken pipe"],"error":{"kind":"Errno::EPIPE","message":"Broken pipe","stack":"/var/vcap/data/packages/director-ruby-3.2/635824a05f62896ff4d7c9159343a670568968b3/lib/ruby/3.2.0/socket.rb:460:in
__write_nonblock'\n/var/vcap/data/packages/director-ruby-3.2/635824a05f62896ff4d7c9159343a670568968b3/lib/ruby/3.2.0/socket.rb:460:inwrite_nonblock'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:201:in
async_send'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:47:inblock in wrap_blocking_method'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:141:in
write'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/protocols/tcp_connection.rb:20:insend_data'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/protocols/graphite_connection.rb:11:in
send_metric'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/plugins/graphite.rb:39:inblock (2 levels) in process'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/semaphore.rb:68:in
block in async'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/task.rb:163:inblock in run'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/task.rb:376:in
block in schedule'\n"}}`Versions (please complete the following information):
Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
we are using bosh-deployment to deploy the director
Additional context
There is a Slack Discussion ongoing about the problem:
https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1715857133644269
The text was updated successfully, but these errors were encountered: