Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of the metrics when a packet is relayed by another relayer #2467

Closed
5 tasks
Tracked by #2457
AlianBenabdallah opened this issue Jul 26, 2022 · 0 comments · Fixed by #2457
Closed
5 tasks
Tracked by #2457
Assignees
Labels
A: bug Admin: something isn't working A: help-wanted Admin: extra attention is needed, good for seniors I: telemetry Internal: related to Telemetry & metrics

Comments

@AlianBenabdallah
Copy link
Contributor

AlianBenabdallah commented Jul 26, 2022

Summary of Bug

Periodically, the packet worker pops operational data from the queues and checks whether they were already submitted by another relayer (see this function). From the code, it looks like the metrics are not updated in this scenario.

Below, I try to create a scenario where a second relayer instance relays the packet and the metrics of the first relayer become inconsistent.

Version

hermes 1.0.0-rc.0+6c23af53

Steps to Reproduce

You should run this on branch ali/fix_oldest_sequence (or master after merge) because the process used to generate pending packets has a high probability of failure and, if it fails, will generate a timeout. This branch handles correctly the oldest_* metrics when a timeout occurs. If you are not using this branch, a timeout will freeze the metrics.

  • Set up 2 chains and create channels between them
  • Run hermes start
  • Create a pending packet : hermes tx ft-transfer --dst-chain ibc-0 --src-chain ibc-1 --src-port transfer --src-channel channel-0 --amount 1000 --timeout-seconds 10 should work most of the times.
  • Wait 30s.
  • Ensure that a packet is indeed pending with hermes query packet pending --chain ibc-1 --port transfer --channel channel-0

You should see an output similar to this :

Success: Summary {
    src: PendingPackets {
        unreceived_packets: [
            Sequence(
                18,
            ),
            Sequence(
                19,
            ),
        ],
        unreceived_acks: [],
    },
    dst: PendingPackets {
        unreceived_packets: [],
        unreceived_acks: [],
    },
}
  • Ensure that the oldest_* are frozen at a non zero value. If it's not the case then no packet is pending or the code is not running on the correct branch.
    You should see non zero metrics for these two fields :
# HELP oldest_sequence Sequence number of the oldest pending SendPacket event.
# TYPE oldest_sequence gauge
oldest_sequence{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 18
# HELP oldest_timestamp Local timestamp of the oldest pending SendPacket event
# TYPE oldest_timestamp gauge
oldest_timestamp{chain="ibc-1",channel="channel-0",counterparty="ibc-0",port="transfer"} 1658849386
  • From another terminal, clear the packets with hermes clear packets --chain ibc-1 --port transfer --channel channel-0
  • Ensure that the metrics are still frozen. You should see non zero values for these two metrics.
  • Ensure that no packet is pending with hermes query packet pending --chain ibc-1 --port transfer --channel channel-0.
    You should see this output :
Success: Summary {
    src: PendingPackets {
        unreceived_packets: [],
        unreceived_acks: [],
    },
    dst: PendingPackets {
        unreceived_packets: [],
        unreceived_acks: [],
    },
}

Acceptance Criteria

Metrics are correctly updated when relayed by another relayer.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@AlianBenabdallah AlianBenabdallah added I: telemetry Internal: related to Telemetry & metrics A: bug Admin: something isn't working A: help-wanted Admin: extra attention is needed, good for seniors labels Jul 26, 2022
@adizere adizere self-assigned this Jul 27, 2022
@adizere adizere linked a pull request Jul 28, 2022 that will close this issue
11 tasks
adizere added a commit that referenced this issue Jul 28, 2022
AlianBenabdallah added a commit that referenced this issue Jul 28, 2022
* clears telemetry sequence history on timeout

* change description to incorporate the fact that oldest can also be waiting for a timeout

* Delete json_encoder.rs

* update guide

* replace match by if let and replace e by event

* fmt

* changelog entry

* Rename 2429-fix-oldest-sequence-timeout.md to 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Renamed oldest_* metrics to backlog_*, refactored assoc. methods

* Fix for #2467 w/ Ali + refactor

Co-authored-by: Adi Seredinschi <adi@informal.systems>
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this issue Sep 13, 2022
* clears telemetry sequence history on timeout

* change description to incorporate the fact that oldest can also be waiting for a timeout

* Delete json_encoder.rs

* update guide

* replace match by if let and replace e by event

* fmt

* changelog entry

* Rename 2429-fix-oldest-sequence-timeout.md to 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Update 2451-fix-oldest-sequence-timeout.md

* Renamed oldest_* metrics to backlog_*, refactored assoc. methods

* Fix for informalsystems#2467 w/ Ali + refactor

Co-authored-by: Adi Seredinschi <adi@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working A: help-wanted Admin: extra attention is needed, good for seniors I: telemetry Internal: related to Telemetry & metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants