Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new metrics to help operators figure out whether there any stuck packets in a channel #2250

Merged
merged 24 commits into from
Jun 29, 2022

Conversation

adizere
Copy link
Member

@adizere adizere commented May 31, 2022

Closes: #2175

Description

TODOs:


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

relayer/src/worker/wallet.rs Outdated Show resolved Hide resolved
@informalsystems informalsystems deleted a comment from adizere Jun 1, 2022
@romac romac marked this pull request as ready for review June 1, 2022 15:33
@romac romac marked this pull request as draft June 1, 2022 15:34
@romac
Copy link
Member

romac commented Jun 24, 2022

@adizere Anything I can help with here?

@adizere
Copy link
Member Author

adizere commented Jun 24, 2022

@adizere Anything I can help with here?

Actually @ljoss17 is leading this effort, he has some interesting WIP in his local copy of this branch.

@adizere
Copy link
Member Author

adizere commented Jun 28, 2022

I ran some tests and noticed we might have a possibly duplicate metric:

# HELP send_packet_count Number of SendPacket relayed.
# TYPE send_packet_count counter
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 1101
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 1610

which records data similarly to:

# HELP ibc_receive_packets Number of receive packets relayed per channel
# TYPE ibc_receive_packets counter
ibc_receive_packets{src_chain="ibc-0",src_channel="channel-0",src_port="transfer"} 1101
ibc_receive_packets{src_chain="ibc-1",src_channel="channel-0",src_port="transfer"} 1610

Looking in the code, it seems they are not the same, however:

  1. send_packet_count is updated from the supervisor, using events directly received from the websocket.
    • this means that this metrics counts all send packet events that a chain emits (counting the activity of all relayers)
    • the metrics is also updated from packet clearing logic
  2. ibc_receive_packets is updated from the packet worker
    • so this metrics counts the send packet events that the current Hermes instance has relayed (not from all relayers active on that chain)

I think we introduced send_packet_count specifically for reporting on slow channels (#2175), but now that we have the oldest_sequence metric, if that counts accurately, then we can probably remove send_packet_count. Or should we keep it for some other reason?

@ljoss17
Copy link
Contributor

ljoss17 commented Jun 28, 2022

I ran some tests and noticed we might have a possibly duplicate metric:

# HELP send_packet_count Number of SendPacket relayed.
# TYPE send_packet_count counter
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 1101
send_packet_count{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 1610

which records data similarly to:

# HELP ibc_receive_packets Number of receive packets relayed per channel
# TYPE ibc_receive_packets counter
ibc_receive_packets{src_chain="ibc-0",src_channel="channel-0",src_port="transfer"} 1101
ibc_receive_packets{src_chain="ibc-1",src_channel="channel-0",src_port="transfer"} 1610

Looking in the code, it seems they are not the same, however:

  1. send_packet_count is updated from the supervisor, using events directly received from the websocket.

    • this means that this metrics counts all send packet events that a chain emits (counting the activity of all relayers)
    • the metrics is also updated from packet clearing logic
  2. ibc_receive_packets is updated from the packet worker

    • so this metrics counts the send packet events that the current Hermes instance has relayed (not from all relayers active on that chain)

I think we introduced send_packet_count specifically for reporting on slow channels (#2175), but now that we have the oldest_sequence metric, if that counts accurately, then we can probably remove send_packet_count. Or should we keep it for some other reason?

The send_packet_count could be useful if there are pending packets. The oldest_sequence will only give the oldest pending packet, while the send_packet_count combined with the acknowledgement_count will give how many packets are pending.
If this information doesn't seem very useful, then I don't see any reason to keep the send_packet and acknowledgement_count.

@ljoss17 ljoss17 marked this pull request as ready for review June 28, 2022 13:01
@romac romac self-assigned this Jun 29, 2022
@romac romac changed the title Metrics improvements Added new metrics to help operators figure out whether there any stuck packets in a channel Jun 29, 2022
@romac romac changed the title Added new metrics to help operators figure out whether there any stuck packets in a channel Add new metrics to help operators figure out whether there any stuck packets in a channel Jun 29, 2022
@romac romac merged commit b382eb3 into master Jun 29, 2022
@romac romac deleted the metrics_improvs branch June 29, 2022 14:09
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
…packets in a channel (informalsystems#2250)

* Various fixes & todos

* Removed unused import

* Report wallet amount divided by 10^6 (eg. uatom as atom) to work around Prometheus only supporting u64

* Undo GM modifications

* Ditto

* WS-based packet counting, part 1

* Added telemetry to record number of send packets and acks

* Added metrics for oldest pending packet sequence number

* Cargo fmt

* Added comments and updated new metrics descriptions

* Renamed some metrics and improved packet number history recording

* Updated retrieving the min sequence number without cloning the DashSet in the telemetry

* post-merge

* Made tx_latency counts more accurate.

* Undo changes to config.toml

* Undo changes to config.toml part 2

* Renamed some telemetry methods and added timestamp for oldest pending packet

* Updated Hermes guide with new telemetry metrics, added changelog entry and improved description of metrics

Co-authored-by: Romain Ruetschi <romain@informal.systems>
Co-authored-by: Luca Joss <luca@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report on slow/stalled channel traffic
3 participants