Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes retrying mechanism in tx confirmation logic should be regenerating messages #1792

Closed
5 tasks done
adizere opened this issue Jan 19, 2022 · 6 comments · Fixed by #1951
Closed
5 tasks done

Hermes retrying mechanism in tx confirmation logic should be regenerating messages #1792

adizere opened this issue Jan 19, 2022 · 6 comments · Fixed by #1951
Assignees
Labels
A: bug Admin: something isn't working I: logic Internal: related to the relaying logic
Milestone

Comments

@adizere
Copy link
Member

adizere commented Jan 19, 2022

Crate

ibc-relayer

Summary of Bug

When Hermes observes that a transaction hash remains unconfirmed, if tx_confirmation = true it proceed to re-submitting that transaction.

The resubmit step hapens here:

https://github.com/informalsystems/ibc-rs/blob/37d54d4d851d3d7af394845e05b17b4d0e66afd7/relayer/src/link/pending.rs#L158-L165

Which simply calls the relay_from_operational_data method:

https://github.com/informalsystems/ibc-rs/blob/2757031e029c5456f7cfe483bcca0e34ba2d5ef4/relayer/src/link/relay_path.rs#L1279

This is problematic, however, because method relay_from_operational_data blindly takes the messages that Hermes originally generated and calls broadcast_tx_sync with them, but these messages may comprise PacketRecv that are no longer relevant (because they might have timed-out in the meantime). If Hermes fails to resubmit succesfully, it retries indefinitely.

Version

v0.10

Steps to Reproduce

Unclear how to reproduce this, the main source of this bug report is a log h/t @faddat --
https://gist.github.com/adizere/e49c5083b3a3bae2d4a03735e5a8196a

As these logs reveal, the timeout in the confirmation logic triggers, and then Hermes tries to resubmit, but then it fails because the packets are timed-out. The relevant snippet is this one:

15 2022-01-19T12:16:07.964933Z ERROR ThreadId(82) packet{src_chain=osmosis-1 src_port=transfer src_channel=channel-0 dst_chain=cosmoshub-4}: [cosmoshub-4:transfer/channel-141 -> osmosis-1] timed out while confirming TxHashes: count=1; 22B9C8BBA689E35BA8C9514AD991F5820047468FF4A5460    2F482FBC14182B6D0
 16 2022-01-19T12:16:07.967472Z  WARN ThreadId(82) packet{src_chain=osmosis-1 src_port=transfer src_channel=channel-0 dst_chain=cosmoshub-4}:relay{odata=4hG-9d5Z37 ->Destination @1-2860135; len=1}: [osmosis-1 -> cosmoshub-4:07-tendermint-259] resolving trusted height from the full l    ist of consensus state heights for target height 1-2860136; this may take a while
 17 2022-01-19T12:16:13.252574Z  INFO ThreadId(82) packet{src_chain=osmosis-1 src_port=transfer src_channel=channel-0 dst_chain=cosmoshub-4}:relay{odata=4hG-9d5Z37 ->Destination @1-2860135; len=1}: assembled batch of 2 message(s)
 18 2022-01-19T12:16:13.255616Z ERROR ThreadId(80) send_tx{id=cosmoshub-4}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 1: receive packet verification failed: bloc    k timestamp >= packet timeout timestamp (2022-01-19 12:15:59.033970382 +0000 UTC >= 2022-01-19 18:31:23.475 +0700 +07): packet timeout", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "9123461"} }
 19 2022-01-19T12:16:13.255662Z ERROR ThreadId(82) packet{src_chain=osmosis-1 src_port=transfer src_channel=channel-0 dst_chain=cosmoshub-4}: error processing pending events in destination chain: failed with underlying error: gRPC call failed with status: status: Unknown, message: "    failed to execute message; message index: 1: receive packet verification failed: block timestamp >= packet timeout timestamp (2022-01-19 12:15:59.033970382 +0000 UTC >= 2022-01-19 18:31:23.475 +0700 +07): packet timeout", details: [], metadata: MetadataMap { headers: {"content-t    ype": "application/grpc", "x-cosmos-block-height": "9123461"} }

Acceptance Criteria

There are two separate problems here:

  1. Hermes should not retry indefinitely to resubmit messages that are no longer relevant (eg timed-out packets), and should just drop them instead of calling push_back in the pending queue.
  2. Hermes should not blindly resubmit messages that may be stale, but should use the original events in the operational data to regenerate (potentially new) messages that it can subsequently use to broadcast_tx_sync.

I think solving (2) will imply solving (1), but not 100% sure.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added A: bug Admin: something isn't working A: low-priority Admin: low priority / non urgent issue, expect longer wait time for PR reviews labels Jan 19, 2022
@adizere adizere added this to the v0.11.0 milestone Jan 19, 2022
@mort1k
Copy link

mort1k commented Jan 24, 2022

Hi,
have same issue, loop timeout txn:

2022-01-24T21:07:11.663416Z  INFO ThreadId(1442) [comdex-1:transfer/channel-1 -> osmosis-1] assembled batch of 2 message(s)
2022-01-24T21:07:11.882081Z ERROR ThreadId(19) [osmosis-1] estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 1: receive packet verification failed: block height >= packet timeout height (1-2931060 >= 1-2916822): packet timeout", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "2931060"} }
2022-01-24T21:07:11.883771Z ERROR ThreadId(1442) error processing pending events in destination chain: failed with underlying error: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 1: receive packet verification failed: block height >= packet timeout height (1-2931060 >= 1-2916822): packet timeout", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "2931060"} }
2022-01-24T21:07:11.887622Z ERROR ThreadId(1441) [osmosis-1:transfer/channel-87 -> comdex-1] timed out while confirming TxHashes: count=1; CAC91B611FAB9ED70C87500AEDF3103EB2E3F4CB810F1F67BEB63B54291DB3FB
2022-01-24T21:07:11.887747Z  INFO ThreadId(1441) [comdex-1:transfer/channel-1 -> osmosis-1] relay op. data of 1 msgs(s) to Destination (height 1-903612), delayed by: 93509.375939661s [try 1/5]
2022-01-24T21:07:11.887833Z  INFO ThreadId(1441) [comdex-1:transfer/channel-1 -> osmosis-1] prepending Destination client update @ height 1-903612
2022-01-24T21:07:11.987169Z  WARN ThreadId(1441) [comdex-1 -> osmosis-1:07-tendermint-1587] resolving trusted height from the full list of consensus state heights for target height 1-903612; this may take a while

@adizere adizere added P-medium and removed A: low-priority Admin: low priority / non urgent issue, expect longer wait time for PR reviews labels Jan 25, 2022
@adizere adizere modified the milestones: v0.12.0, v0.12.1 Feb 14, 2022
@adizere adizere assigned seanchen1991 and unassigned ancazamfir Feb 22, 2022
@adizere adizere added the I: logic Internal: related to the relaying logic label Feb 24, 2022
@seanchen1991
Copy link
Contributor

Without knowing how to reproduce this issue in the logs, it's unclear whether any attempts at addressing sub-problems 1 or 2 will actually fix the issue. @faddat or @mort1k, do either of you have more information on the flow that you used to observe this issue?

@adizere
Copy link
Member Author

adizere commented Feb 25, 2022

One potential way to reproduce this is to set very aggressive timeout, eg 2 seconds, as the confirmation mechanism parameter. This way, the tx confirmation mechanism in Hermes will timeout before it finds confirmations on the chain, and it will consequently resubmit the same packets again.

Steps:

  1. set TIMEOUT = 1 sec
  2. have two chains running locally and a channel-0 between them
  3. hermes start in one terminal
  4. hermes tx raw ft-tranfer .. channel-0 in a second terminal
  5. now we should see re-submissions of the same tx

I managed to do the above steps and some interesting results: Hermes resubmitting the same packet 4 times in a row.

Logs
2022-02-25T10:17:51.012988Z  INFO ThreadId(01) Hermes has started

 ---- original RecvPacket sent
2022-02-25T10:25:13.412590Z  INFO ThreadId(317) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: assembled batch of 2 message(s)
2022-02-25T10:25:13.416096Z  INFO ThreadId(18) send_tx{id=ibc-1}: refresh: retrieved account sequence=11 number=1
2022-02-25T10:25:13.434490Z  INFO ThreadId(317) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: [Async~>ibc-1] response(s): 1; Ok:28348BF565621931E1B6A6AF27CB1029F9380F4B265C8AFD7F35AD03F4CCB66D
2022-02-25T10:25:13.434614Z  INFO ThreadId(317) packet_cmd{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: success

 ---- resubmit (1)
2022-02-25T10:25:15.169760Z ERROR ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 28348BF565621931E1B6A6AF27CB1029F9380F4B265C8AFD7F35AD03F4CCB66D
2022-02-25T10:25:15.215220Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: assembled batch of 2 message(s)
2022-02-25T10:25:15.228404Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: [Async~>ibc-1] response(s): 1; Ok:F1558D730A62E0E886ED0FCD0B9CBCBA12911484A0650FECADD7C018029E7B4A
2022-02-25T10:25:15.228477Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: success

 ---- resubmit (2)
2022-02-25T10:25:16.236751Z ERROR ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; F1558D730A62E0E886ED0FCD0B9CBCBA12911484A0650FECADD7C018029E7B4A
2022-02-25T10:25:16.278959Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: assembled batch of 2 message(s)
2022-02-25T10:25:16.291499Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: [Async~>ibc-1] response(s): 1; Ok:690318C27D1C7DFF9523873957584985E676F53A0A9C68E60A2A35FBD13E62C7
2022-02-25T10:25:16.291586Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: success

 ---- resubmit (3)
2022-02-25T10:25:17.298742Z ERROR ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 690318C27D1C7DFF9523873957584985E676F53A0A9C68E60A2A35FBD13E62C7
2022-02-25T10:25:17.341111Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: assembled batch of 2 message(s)
2022-02-25T10:25:17.353080Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: [Async~>ibc-1] response(s): 1; Ok:08DA6BD1F4F93DE507359FE915AB7C09BB0C386A24528AFD18C671ED7F6A8AF3
2022-02-25T10:25:17.353156Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: success

 ---- resubmit (4)
2022-02-25T10:25:18.359299Z ERROR ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}: timed out while confirming TxHashes: count=1; 08DA6BD1F4F93DE507359FE915AB7C09BB0C386A24528AFD18C671ED7F6A8AF3
2022-02-25T10:25:18.401260Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: assembled batch of 2 message(s)
2022-02-25T10:25:18.453188Z  INFO ThreadId(318) packet{src_chain=ibc-0 src_port=transfer src_channel=channel-1 dst_chain=ibc-1}:relay{odata=UdafpEuor5 ->Destination @0-9251; len=1}: [Async~>ibc-1] response(s): 1; Ok:0947D91C50511A8395A6F0F5FD8F3E529ECE977AC8580AE92023DEE015365A49

I'm actually starting to think we should eliminate the resubmission step altogether -- or at least if periodic packet clearing is enabled. If clear_interval > 0 then Hermes will anyway periodically clear pending packets, so the two "packet clearing" mechanism overlap with the resubmission step.

@seanchen1991
Copy link
Contributor

If clear_interval > 0 then Hermes will anyway periodically clear pending packets

Does this mean we want to do something like

// only resubmit if `clear_interval` is set to 0
if clear_interval == 0 {
  let resubmit_res = resubmit(pending.original_od.clone());
  ...
}

@adizere
Copy link
Member Author

adizere commented Feb 25, 2022

Yes, along those lines.

But that will not be sufficient to fix the issue entirely. There will remain the case when we're resubmitting, and in that case we should (like in the issue description)

Hermes should not blindly resubmit messages that may be stale, but should use the original events in the operational data to regenerate (potentially new) messages that it can subsequently use to broadcast_tx_sync.

So something like:

// only resubmit if `clear_interval` is set to 0
if clear_interval == 0 {
  let new_od = regenerate_operational_data(pending.original_od);
  let resubmit_res = resubmit(new_od);
  ...
}

@seanchen1991
Copy link
Contributor

I think this fix is going to require some larger architectural changes than we might have initially thought. regenerate_operational_data is a method on RelayPath, and introducing that type as a parameter in the requisite methods seems like a messy way to go about this refactor. I'm not quite sure how to better handle this; I would appreciate anyone's thoughts and/or feedback on how to address this issue more cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working I: logic Internal: related to the relaying logic
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

5 participants