Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health_Monitor stop sending logs #2522

Closed
benjaminguttmann-avtq opened this issue May 17, 2024 · 2 comments · Fixed by #2523
Closed

Health_Monitor stop sending logs #2522

benjaminguttmann-avtq opened this issue May 17, 2024 · 2 comments · Fixed by #2523
Assignees

Comments

@benjaminguttmann-avtq
Copy link

Describe the bug
Health_Monitor stops sending out metrics to graphite. After a restart of health_monitor it starts working again.

To Reproduce
There is no clear path to reproduce right now.

Expected behavior
I expect that health_monitor sends metrics to the configured graphite plugin.

Logs

{"time":"2024-05-16T10:46:51+00:00","severity":"warn","oid":761205800,"pid":6,"subject":"Async::Task","message":["Task may have ended with unhandled exception.","Broken pipe"],"error":{"kind":"Errno::EPIPE","message":"Broken pipe","stack":"/var/vcap/data/packages/director-ruby-3.2/635824a05f62896ff4d7c9159343a670568968b3/lib/ruby/3.2.0/socket.rb:460:in __write_nonblock'\n/var/vcap/data/packages/director-ruby-3.2/635824a05f62896ff4d7c9159343a670568968b3/lib/ruby/3.2.0/socket.rb:460:in write_nonblock'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:201:in async_send'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:47:in block in wrap_blocking_method'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-io-1.42.1/lib/async/io/generic.rb:141:in write'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/protocols/tcp_connection.rb:20:in send_data'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/protocols/graphite_connection.rb:11:in send_metric'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/bosh-monitor-0.0.0/lib/bosh/monitor/plugins/graphite.rb:39:in block (2 levels) in process'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/semaphore.rb:68:in block in async'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/task.rb:163:in block in run'\n/var/vcap/data/packages/health_monitor/27e7911210889905f3292fbc6097cd9d827b24ca/gem_home/ruby/3.2.0/gems/async-2.10.2/lib/async/task.rb:376:in block in schedule'\n"}}`

Versions (please complete the following information):

  • Infrastructure: AWS
  • BOSH version: 280.0.23
  • BOSH CLI version: 7.4.1
  • Stemcell version: ubuntu-jammy/1.423

Deployment info:
If possible, share your (redacted) manifest and any ops files used to deploy
we are using bosh-deployment to deploy the director

Additional context
There is a Slack Discussion ongoing about the problem:

https://cloudfoundry.slack.com/archives/C02HPPYQ2/p1715857133644269

cunnie pushed a commit that referenced this issue May 21, 2024
When health_monitor is sending metrics to, for example, graphite, and
there is a network error (e.g. `Errno::EPIPE`), it stops sending metrics
until health_monitor is restarted.

This was a regression due to replacing the `EventMachine` gem with the `Async`
gem.

This commit fixes the regression by, when a network error occurs,
attempting to re-establish the connection and continue sending data.

The attempt to re-establish the connection follows the same retry &
backoff logic as when establishing the initial connection.

Note: we feel the method `unbind` is poorly named; it should be named
`close_old_and_open_new_connection`.

[fixes #2522]

[#187636407]

Signed-off-by: Brian Cunnie <brian.cunnie@broadcom.com>
@jpalermo
Copy link
Member

@benjaminguttmann-avtq , does v280.0.24 fix the issue for you?

We had some tests using nc to reproduce the problem and fix it, but for some reason those tests we were using no longer work at all with 280.0.24. We think this might be related to a bump in the async gems, but aren't totally sure.

@jpalermo
Copy link
Member

Just kidding, It seems like 280.0.24 is even more broken due to a change in the async gem no longer requiring async-io. We're adding a new require to fix that and hopefully everything is fixed then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

3 participants