Report on slow/stalled channel traffic #2175

greg-szabo · 2022-05-03T23:43:42Z

Summary

User story: I need to better understand when a channel is being relayed properly. For monitoring and alerting purposes, I need to be able to query the oldest sequence number that is still in the queue for a specific channel and find out how old (date) the packet is.

Problem Definition

We need to better monitor if a channel is being relayed properly or not. Out-of-band monitoring has the benefit of not relying on the technology actually doing the relaying but it has the disadvantage that it has to describe the infrastructure and application setup yet again from scratch. For example the hermes config details the relationship among networks so well, that if I say "channel-0" on the Osmosis network, everyone (including a program) understands exactly what that means (which endpoint represents it, what wallet can I use to manage it, etc).

Implementing this (and subsequent monitoring related) feature in Hermes takes advantage of the already existing configuration and library knowledge of endpoints. (Writing curl scripts to poll endpoint health is not fun. Especially, on gRPC.)

Including this and similar requests makes Hermes "ready with batteries" for production use, including monitoring assets. (Well, prometheus endpoints or HTTP API calls or somesuch. The operator still need to gather the data somewhere and present it, say, using Grafana.)

Disadvantage of this kind of feature is that it opens up topic that is not strictly IBC as a protocol but more on the side of "IBC as a product used in servers". Personally, I think it shows the maturity of a project, but others might have differing opinions. This request is fairly specific which might be good (when everyone needs it) or not so good (when it only serves one specific use-case of an operator).

Proposal

There is a monitoring bot on Discord that essentially does something similar. The goal is to find out if a channel has "stuck" packets: we define "stuck" packets as packets that haven't been relayed for 5 minutes.

One implementation idea:
One or more prometheus metric(s) per-channel configured in Hermes, that displays the oldest sequence number on the channel still in the queue as well as the submission date associated with the sequence number. (Extra query to the channel.)

This could be picked up by any monitoring tool and alert on it every 5 minutes (or whatever the operator configures).

Alternatively, if this doesn't fit the prometeus metrics specs, it could be a HTTP web API call that responds with the data in a JSON object. Somehow, I feel prometheus should fit here, but we're open to other implementations (even a CLI command, if necessary). The one implementation that doesn't work for us is plugging this data into the log file. The data has to be independently queryable, mostly separated from the current operational state of Hermes. (As mentioned in the disadvantages.)

Acceptance Criteria

There is a way to query the "stuck state" of a channel.
Alternatively: the backlog in a channel is exposed in telemetry, highlighting the timestamp of the oldest unrelayed packet.
- Constraint: Do not introduce additional query pressure
- cf. "Hermes devs <> ops" notes from 24 May 2022.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

adizere added this to the v1.0.0 milestone May 10, 2022

adizere added I: CLI Internal: related to the relayer's CLI I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product E: osmosis External: related to Osmosis labels May 10, 2022

adizere added the P-high label May 27, 2022

adizere self-assigned this May 27, 2022

adizere mentioned this issue Jun 13, 2022

Add new metrics to help operators figure out whether there any stuck packets in a channel #2250

Merged

10 tasks

adizere assigned ljoss17 and unassigned adizere Jun 28, 2022

romac closed this as completed in #2250 Jun 29, 2022

romac added the I: telemetry Internal: related to Telemetry & metrics label Jul 11, 2022

ljoss17 mentioned this issue Jul 20, 2022

Improve metrics descriptions and naming #2409

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on slow/stalled channel traffic #2175

Report on slow/stalled channel traffic #2175

greg-szabo commented May 3, 2022 •

edited by adizere

Loading

Report on slow/stalled channel traffic #2175

Report on slow/stalled channel traffic #2175

Comments

greg-szabo commented May 3, 2022 • edited by adizere Loading

Summary

Problem Definition

Proposal

Acceptance Criteria

For Admin Use

greg-szabo commented May 3, 2022 •

edited by adizere

Loading