Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report on slow/stalled channel traffic #2175

Closed
8 tasks
greg-szabo opened this issue May 3, 2022 · 0 comments · Fixed by #2250
Closed
8 tasks

Report on slow/stalled channel traffic #2175

greg-szabo opened this issue May 3, 2022 · 0 comments · Fixed by #2250
Assignees
Labels
E: osmosis External: related to Osmosis I: CLI Internal: related to the relayer's CLI I: logic Internal: related to the relaying logic I: telemetry Internal: related to Telemetry & metrics O: usability Objective: cause to improve the user experience (UX) and ease using the product
Milestone

Comments

@greg-szabo
Copy link
Member

greg-szabo commented May 3, 2022

Summary

User story: I need to better understand when a channel is being relayed properly. For monitoring and alerting purposes, I need to be able to query the oldest sequence number that is still in the queue for a specific channel and find out how old (date) the packet is.

Problem Definition

We need to better monitor if a channel is being relayed properly or not. Out-of-band monitoring has the benefit of not relying on the technology actually doing the relaying but it has the disadvantage that it has to describe the infrastructure and application setup yet again from scratch. For example the hermes config details the relationship among networks so well, that if I say "channel-0" on the Osmosis network, everyone (including a program) understands exactly what that means (which endpoint represents it, what wallet can I use to manage it, etc).

Implementing this (and subsequent monitoring related) feature in Hermes takes advantage of the already existing configuration and library knowledge of endpoints. (Writing curl scripts to poll endpoint health is not fun. Especially, on gRPC.)

Including this and similar requests makes Hermes "ready with batteries" for production use, including monitoring assets. (Well, prometheus endpoints or HTTP API calls or somesuch. The operator still need to gather the data somewhere and present it, say, using Grafana.)

Disadvantage of this kind of feature is that it opens up topic that is not strictly IBC as a protocol but more on the side of "IBC as a product used in servers". Personally, I think it shows the maturity of a project, but others might have differing opinions. This request is fairly specific which might be good (when everyone needs it) or not so good (when it only serves one specific use-case of an operator).

Proposal

There is a monitoring bot on Discord that essentially does something similar. The goal is to find out if a channel has "stuck" packets: we define "stuck" packets as packets that haven't been relayed for 5 minutes.

One implementation idea:
One or more prometheus metric(s) per-channel configured in Hermes, that displays the oldest sequence number on the channel still in the queue as well as the submission date associated with the sequence number. (Extra query to the channel.)

This could be picked up by any monitoring tool and alert on it every 5 minutes (or whatever the operator configures).

Alternatively, if this doesn't fit the prometeus metrics specs, it could be a HTTP web API call that responds with the data in a JSON object. Somehow, I feel prometheus should fit here, but we're open to other implementations (even a CLI command, if necessary). The one implementation that doesn't work for us is plugging this data into the log file. The data has to be independently queryable, mostly separated from the current operational state of Hermes. (As mentioned in the disadvantages.)

Acceptance Criteria

  • There is a way to query the "stuck state" of a channel.
  • Alternatively: the backlog in a channel is exposed in telemetry, highlighting the timestamp of the oldest unrelayed packet.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added this to the v1.0.0 milestone May 10, 2022
@adizere adizere added I: CLI Internal: related to the relayer's CLI I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product E: osmosis External: related to Osmosis labels May 10, 2022
@adizere adizere added the P-high label May 27, 2022
@adizere adizere self-assigned this May 27, 2022
@adizere adizere assigned ljoss17 and unassigned adizere Jun 28, 2022
@romac romac added the I: telemetry Internal: related to Telemetry & metrics label Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E: osmosis External: related to Osmosis I: CLI Internal: related to the relayer's CLI I: logic Internal: related to the relaying logic I: telemetry Internal: related to Telemetry & metrics O: usability Objective: cause to improve the user experience (UX) and ease using the product
Projects
No open projects
Status: Closed
4 participants