Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

adizere · 2021-06-01T13:58:41Z

Crate

ibc-relayer primarily

Summary of Bug

The fault-tolerance mechanism introduced in #895 (and the follow-up PR #903) ensures that
Hermes can cope with a full node's websocket endpoint becoming unreachable, and continues
to function unaffected once the endpoint is reachable again.
We introduced a regression bug (possibly with the batching
feature, as Romain suggested) because this mechanism no longer works.

Version

e4a6543

Steps to Reproduce

run scripts/dev-env with two chains, e.g., ibc-0 and ibc-1
run hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer and wait for it to finish
run hermes start in one terminal, make sure your logging level is at least debug
kill one of the gaia instances and watch the hermes output
- kill -9 GAIA_PID
- some error should appear in the log
so far so good
run scripts/dev-env again with the same configuration as before
run hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer again
run hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000

The problem is that Hermes (the one we started at step 3) should pick up the connection to the two gaia instances and relay the packet. But this does not happen. Instead, Hermes does connect via websocket to the chains, but it does not receive any events from either of the two chains.

What should happen

If using version 20d8fff of Hermes, and running the same recipe as above, at steps 7 and 8 we would see Hermes workers starting and doing active work (relaying). For example, the output should be:

Jun 01 16:55:31.073 DEBUG ibc_relayer::supervisor: chain ibc-0 sent events:
	UpdateClientEv(h: 0-27, cs_h: 07-tendermint-0(1-16))
 for object Client(Client { dst_chain_id: ChainId { id: "ibc-0", version: 0 }, dst_client_id: ClientId("07-tendermint-0"), src_chain_id: ChainId { id: "ibc-1", version: 1 } })

Note that 20d8fff will requite that we use command hermes start-multi at step 2 (instead of hermes start) and the configuration .toml file should have strategy = 'naive'.

Acceptance Criteria

Hermes should behave again as it did at version 20d8fff

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate milestone (priority) applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

ancazamfir · 2021-06-02T10:00:40Z

A few thoughts (writing them here but we'll need separate issues)

yes we should fix the reconnect as a first step (Enable logging in tests and fix error log flooding from mock chain runtime #1017)
since it is possible that events have been missed, workers should "reset", e.g. unipath should clear packets, client recheck updates, etc
it is also possible that events that would have spawned new workers were missed
we need to make an analysis on handling websocket vs rpc vs grpc failures. These services can be configured to use different nodes but we know that for the passive relayer the events are vital. So maybe just stopping all workers involved and rescanning state at reconnect time would be ok (similar to remove+add chain)

Testing

the test with dev-env is ok to test some basic things but not enough.
this test simulates chains stopping and restarting (e.g. chain upgrade with reset height). Normally this should happen with a higher version number (so for hermes these are new chains, e.g. ibc-0 node will be upgraded and restarted as ibc-1)
if the gaiad restart happens fast enough we are left with some workers from the old instance for channels/ clients that don't exist anymore.
instead we should use gaia manager to setup a test environment for these scenarios (we would need two full nodes, one to bring down and used for events/queries etc by hermes start and one to be used by another hermes instance that makes IBC state changes)

Other issues I noticed while debugging this:

logs are very inconsistent in looks, see below, we should enforce a consistent span on all logs.
- DEBUG ibc_relayer::worker::client: client 'ibc-1 -> ibc-0:07-tendermint-0'
- DEBUG ibc_relayer::foreign_client: [ibc-0 -> ibc-1:07-tendermint-0]
- INFO ibc_relayer::worker: [channel-0/transfer:ibc-0->ibc-1]
packet workers retry and exit, client workers do not
channel worker is not yet integrated with the telemetry
channel worker does not have an outer retry loop (it retries for individual commands)

romac · 2021-06-02T11:49:56Z

Further work on this tracked in #1035

* Added details about the help command in the guide * Bump version to 0.4.0 * Update guide to account for `start-multi` being promoted to `start` * Fix changelog * Document telemetry section of config file * Fixup documentation for global section of configuration file * Document type of each config option * Remove unsused config default method * Guide update for the query clients method. * Typo fix * Re-add Cargo.lock for proto-compiler crate * Document addition of `host` param to telemetry config * Document telemetry service * Update changelog with telemetry * Add changelog entry for #1026 * Channel worker updates * Add missing files * Update feature matrix * Update mdbook to v0.4.7 * Update mdbook to v0.4.9 * Add cosmos-sdk versions supported * Higlight compat info * Write summary of 0.4.0 release Co-authored-by: Romain Ruetschi <romain@informal.systems> Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>

* Added details about the help command in the guide * Bump version to 0.4.0 * Update guide to account for `start-multi` being promoted to `start` * Fix changelog * Document telemetry section of config file * Fixup documentation for global section of configuration file * Document type of each config option * Remove unsused config default method * Guide update for the query clients method. * Typo fix * Re-add Cargo.lock for proto-compiler crate * Document addition of `host` param to telemetry config * Document telemetry service * Update changelog with telemetry * Add changelog entry for informalsystems#1026 * Channel worker updates * Add missing files * Update feature matrix * Update mdbook to v0.4.7 * Update mdbook to v0.4.9 * Add cosmos-sdk versions supported * Higlight compat info * Write summary of 0.4.0 release Co-authored-by: Romain Ruetschi <romain@informal.systems> Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>

adizere added the A: bug Admin: something isn't working label Jun 1, 2021

adizere added this to the 05.2021 milestone Jun 1, 2021

adizere assigned romac Jun 1, 2021

romac mentioned this issue Jun 1, 2021

Restart the event monitor event loop to reinitialize the subscriptions Stream after a restart #1027

Merged

5 tasks

adizere mentioned this issue Jun 2, 2021

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Merged

5 tasks

ancazamfir closed this as completed in #1027 Jun 2, 2021

romac mentioned this issue Jun 2, 2021

Improve resilience to nodes going down #1035

Closed

13 tasks

romac added a commit that referenced this issue Jun 2, 2021

Add changelog entry for #1026

2eac3c7

This was referenced Sep 10, 2021

Test scenarios for multi-network IBC #897

Closed

Chain upgrade requirements for Hermes (meta issue) #1209

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

adizere commented Jun 1, 2021

ancazamfir commented Jun 2, 2021

romac commented Jun 2, 2021

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Comments

adizere commented Jun 1, 2021

Crate

Summary of Bug

Version

Steps to Reproduce

What should happen

Acceptance Criteria

For Admin Use

ancazamfir commented Jun 2, 2021

romac commented Jun 2, 2021