Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metrics descriptions and naming #2409

Merged
merged 11 commits into from
Jul 21, 2022

Conversation

ljoss17
Copy link
Contributor

@ljoss17 ljoss17 commented Jul 15, 2022

Closes: #2408

Description

This PR applies the following improvements to some telemetry metrics:

  • Renamed the following metric: cleared_count to cleared_send_packet_count
  • Added new metric: cleared_acknowledgment_count
  • Improve help message for:
    • ibc_receive_packets
    • ibc_acknowledgment_packets
    • ibc_timeout_packets
    • send_packet_count
    • acknowledgement_count
    • cleared_send_packet_count
    • oldest_sequence
  • Added histogram buckets for:
    • tx_latency_submitted
    • tx_latency_confirmed
  • Changed default value of tx_confirmation to false for config.toml

Improvements suggested by operators:

  • for oldest sequence number & timestamp, once that packet is relayed, the metric should be updated with a default (0) value so that it no longer signals any oldest sn/timestamp

PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

telemetry/src/state.rs Outdated Show resolved Hide resolved
| `send_packet_count` | Number of SendPacket events processed | `u64` Counter |
| `acknowledgement_count` | Number of WriteAcknowledgement events processed  | `u64` Counter |
| `cleared_send_packet_count`  | Number of SendPacket events processed during the initial and periodic clearing | `u64` Counter  |
| `cleared_acknowledgment_count` | Number of WriteAcknowledgement events processed during the initial and periodic clearing | `u64` Counter  |
Copy link
Collaborator

@ancazamfir ancazamfir Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to get this counter in the example below? I tried and was not able to get it. Here is what I tried:

  • send 20 packets to ibc-0
    hermes tx ft-transfer --sender-chain ibc-0 --receiver-chain ibc-1 --sender-channel channel-0 --sender-port transfer --amount 90 --timeout-height-offset 3000 --number-msgs 20 --key-name user2
  • check pending (shows 20 unreceived_packets)
    hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
  • receive the 20 packets on ibc-1
    hermes tx packet-recv --receiver-chain ibc-1 --sender-chain ibc-0 --sender-port transfer --sender-channel channel-0
  • check pending (shows 20 unreceived_acks)
    hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
  • send 10 more packets to ibc-0
    hermes tx ft-transfer --sender-chain ibc-0 --receiver-chain ibc-1 --sender-channel channel-0 --sender-port transfer --amount 90 --timeout-height-offset 3000 --number-msgs 10 --key-name user2
  • check pending (shows 10 unreceived_packets and 20 unreceived_acks)
    hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
  • start hermes
    hermes start
  • wait a bit for hermes to finish clearing and then check metrics. I get these (cleared_acknowledgment_count is missing):
# HELP acknowledgement_count Number of WriteAcknowledgement events processed
# TYPE acknowledgement_count counter
acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10

# HELP cleared_send_packet_count Number of SendPacket events processed during the initial and periodic clearing
# TYPE cleared_send_packet_count counter
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10

# HELP msg_num How many messages Hermes submitted to the chain, per chain
# TYPE msg_num counter
msg_num{chain="ibc-0"} 32
msg_num{chain="ibc-1"} 11

# HELP send_packet_count Number of SendPacket events processed
# TYPE send_packet_count counter
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10 

For me also the latency metrics are missing for one chain (ibc-1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I am not getting the cleared_acknowledgment_count as expected. I will investigate this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an issue in where the cleared_acknowledgment_count was counted. I pushed a fix and running the example you gave now correctly has the following metrics:

acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10

There is indeed an issue recording latencies with the clearing:

  • clearing unreceived_acks doesn't record any latency
  • clearing unreceived_packets only record latencies for the sender chain
  • sending packets when Hermes is running correctly record the latencies

I will check and see how this can be corrected

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent, thanks for looking into this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me also the latency metrics are missing for one chain (ibc-1)

The latency uses the tracking id in order to record the submit and confirmed time. The tracking id of packets cleared wasn’t being recorded, thus no latency was measured. In addition, we were using a static string for the tracking id when clearing packets.
In order to have the latencies even when clearing packets, I added an additional enum to the TrackingId, which still allows to differentiate cleared packets by the tracking id as they have a prefix "cleared/“ to a UUID.

By repeating the example completely I correctly have 2 latencies submitted and confirmed from ibc-0 to ibc-1, the acks and recv pending. And 1 latency submitted and confirmed from ibc-1 to ibc-0 which should be the ack from the pending recv.
And by repeating the first 4 steps, in order to have only pending acks, the metrics show 1 latency submitted and confirmed from ibc-0 to ibc-1.

What do you think about this solution ? The changes are in the commit : a75810d

Copy link
Collaborator

@ancazamfir ancazamfir Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

back to the counters, for the example I gave I get now:

acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20

send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10

msg_num{chain="ibc-0"} 32
msg_num{chain="ibc-1"} 11

should acknowledgement_count be 30 if it is the total? Meaning there were 20 pending acks plus the 10 that came in live from the processing of the 10 send_packet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind these four metrics is to give information on two separate interests:

  1. Has Hermes handled all the WriteAcknowledgment events related to the SendPacket events it handled. This is shown by send_packet_count and acknowledgment_count.
  2. Of all the SendPacket and WriteAcknowledgement events handled, how many were from packet clearing. This is shown by cleared_send_packet_count and cleared_acknowledgement_count.

So in my opinion the acknowledgment_count should always mirror the send_packet_count if everything is working correctly.
If there were pending WriteAcknowledgment events, and acknowledgment_count is not dissociated from cleared_acknowledgement_count, then send_packet_count will not mirror acknowledgment_count, since the SendPacket event have been handled before Hermes started.
And for the cleared_send_packet_count it is the opposite, if the pending SendPacket events are not included in the send_packet_count it will differ from acknowledgment_count since these SendPacket events will have a corresponding WriteAcknowledgment event.
I can see the potential confusion these metrics can cause, if this solution doesn't seem to fit the requirements I can change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me this is confusing. Also send_packet_count will not mirror acknowledgment_count if any packets have timed out. IMO we should be consistent for send and ack metrics. But if we keep things the way they are we need to explain very clearly what the metrics mean in the help strings. The name of the counters would also need to reflect the difference (send_packet_count is a total while the acknowledgment_count counts the live events only). Something like:

live_acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20

total_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10

We could also have live_send_packet_count instead of total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I hadn't taken into account the timeouts. The points raised regarding these metrics make me realise that these metrics seem more confusing than helpful.

The addition of these metrics was originally to spot slow/stalled traffic #2175, but in my opinion the oldest_sequence and oldest_timestamp should be enough to track this.

Furthermore these metrics, send_packet_count, acknowledgment_count, cleared_send_packet_count and cleared_acknowledgement_count seem to aim at corner cases rather than general cases, they work well if there are no timeouts, and/or pending packets.

The more points were raised the more I had trouble convincing myself of the usefulness of these metrics. I am wondering if it is a good idea to keep them. What do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep something, they are helpful as there is no other hint on possible packets relayed when tx_confirmation is false. It's the best we can do to say we processed x, y type events.
What is not completely clear is how to reduce the confusion. As I said, if we keep what we have, we can clarify by having good help and metric names. Since these are already in the code base I would say let's do it the best we can in this PR. Once we experiment more we can change later along the other lines I mentioned or something else or even remove them.

telemetry/src/state.rs Outdated Show resolved Hide resolved
config.toml Outdated Show resolved Hide resolved
guide/src/telemetry.md Outdated Show resolved Hide resolved
…and updated telemetry output example in Hermes guide
@adizere
Copy link
Member

adizere commented Jul 20, 2022

Changed default value of tx_confirmation to false for config.toml

I'm curious, what is the rationale for making this default false? Some operators are running with this enabled. Some are disabling it for optimization reasons.

@ancazamfir
Copy link
Collaborator

Changed default value of tx_confirmation to false for config.toml

I'm curious, what is the rationale for making this default false? Some operators are running with this enabled. Some are disabling it for optimization reasons.

It involves more complex processing and logic. The config also says "experimental". And for some reason I thought most operators run with it disabled. Is there any way to gauge this?

@adizere
Copy link
Member

adizere commented Jul 21, 2022

Is there any way to gauge this?

I'm not sure what you mean. Do you mean measure/analytics what % of operators have tx_confirmation = true?

We discussed yesterday with operators and here are 3 takeaways, also based on my observations so far:

  1. Even after a channel is cleared, oldest_* metrics can remain to the same value (i.e., not reset to 0).

  2. Some operators do a discovery phase to understand the Prometheus schema, but Hermes upon starting up does not populate most of the metric fields

    • the most important fields that we should initially populate with some zero value would be the oldest_* ones and the ones related to activity (msg_num and send_packet_count for example).
  3. I think we should clarify what the oldest_timestamp is, it seems this field is a local timestamp to the Hermes process, not an on-chain packet timestamp (when the packet was created), which is not clear from the telemetry help message, specifically:

    # HELP oldest_timestamp The timestamp of the oldest sequence number in seconds
    # TYPE oldest_timestamp gauge
    oldest_timestamp{chain="ibc-0",channel="channel-3",counterparty="ibc-1",port="transfer"} 0

@ancazamfir
Copy link
Collaborator

I'm not sure what you mean. Do you mean measure/analytics what % of operators have tx_confirmation = true?

Yes.

I agree with all of your other 3 points @adizere. I was also confused about the lack of "discovery".

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements in here! Did some tests and looks clearly better than what is currently in master. Will pick up the work in a follow-up issue + PR.

@adizere adizere merged commit 53f3873 into master Jul 21, 2022
@adizere adizere deleted the luca_joss/improve_metrics_descriptions branch July 21, 2022 14:43
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Improved naming and description of some telemetry metrics. Changed configuration 'tx_confirmation' default to be false

* Added changelog entry

* Updated guide with new telemetry changes

* Renamed clear packet metrics

* Updated some metric descriptions and conditional display

* Fixed metric used to count cleared acknowledgments, updated latencies histogram buckets.

* Clippy fix

* Added new enum to TrackindId in order to track latency of packets cleared

* Fixes in comments

* Improved changelog entry, fixed description of wallet balance metric and updated telemetry output example in Hermes guide
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some issues and wrinkles with telemetry metrics
4 participants