Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resilience to nodes going down #1035

Closed
8 of 13 tasks
romac opened this issue Jun 2, 2021 · 2 comments
Closed
8 of 13 tasks

Improve resilience to nodes going down #1035

romac opened this issue Jun 2, 2021 · 2 comments
Labels
A: help-wanted Admin: extra attention is needed, good for seniors I: guide Internal: issues with the Hermes guide I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC O: new-feature Objective: cause to add a new feature or support O: usability Objective: cause to improve the user experience (UX) and ease using the product
Milestone

Comments

@romac
Copy link
Member

romac commented Jun 2, 2021

From @ancazamfir's comment:

A few thoughts on Hermes resilience to nodes going down:

  • since it is possible that events have been missed, workers should "reset", e.g. unipath should clear packets, client recheck updates, etc
  • it is also possible that events that would have spawned new workers were missed
  • we need to make an analysis on handling websocket vs rpc vs grpc failures. These services can be configured to use different nodes but we know that for the passive relayer the events are vital. So maybe just stopping all workers involved and rescanning state at reconnect time would be ok (similar to remove+add chain)

Testing

  • the test with dev-env is ok to test some basic things but not enough.
  • this test simulates chains stopping and restarting (e.g. chain upgrade with reset height). Normally this should happen with a higher version number (so for hermes these are new chains, e.g. ibc-0 node will be upgraded and restarted as ibc-1)
  • ?? if the gaiad restart happens fast enough we are left with some workers from the old instance for channels/ clients that don't exist anymore.
  • instead we should use gaia manager to setup a test environment for these scenarios (we would need two full nodes, one to bring down and used for events/queries etc by hermes start and one to be used by another hermes instance that makes IBC state changes):

Other issues I noticed while debugging #1026:

  • logs are very inconsistent in looks, see below, we should enforce a consistent span on all logs.
    • DEBUG ibc_relayer::worker::client: client 'ibc-1 -> ibc-0:07-tendermint-0'
    • DEBUG ibc_relayer::foreign_client: [ibc-0 -> ibc-1:07-tendermint-0]
    • INFO ibc_relayer::worker: [channel-0/transfer:ibc-0->ibc-1]
  • packet workers retry and exit, client workers do not
  • channel worker is not yet integrated with the telemetry
  • channel worker does not have an outer retry loop (it retries for individual commands)

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@romac romac added O: new-feature Objective: cause to add a new feature or support A: help-wanted Admin: extra attention is needed, good for seniors I: logic Internal: related to the relaying logic O: usability Objective: cause to improve the user experience (UX) and ease using the product I: guide Internal: issues with the Hermes guide I: rpc Internal: related to (g)RPC labels Jun 2, 2021
@romac romac changed the title Improve resilience of relayer Improve resilience to nodes going down Jun 2, 2021
@adizere adizere added this to the 08.2021 milestone Aug 3, 2021
@adizere
Copy link
Member

adizere commented Aug 3, 2021

Some of these issues were handled (e.g., #1205, #1138).

@adizere adizere modified the milestones: 08.2021, 09.2021 Sep 6, 2021
@adizere
Copy link
Member

adizere commented Sep 10, 2021

Closing as all issues have either been handled or are tracked separately in more appropriate meta-issues.

@adizere adizere closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: help-wanted Admin: extra attention is needed, good for seniors I: guide Internal: issues with the Hermes guide I: logic Internal: related to the relaying logic I: rpc Internal: related to (g)RPC O: new-feature Objective: cause to add a new feature or support O: usability Objective: cause to improve the user experience (UX) and ease using the product
Projects
None yet
Development

No branches or pull requests

2 participants