Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidation of ICS3 TLA+ spec #108

Closed
4 tasks done
adizere opened this issue Jun 18, 2020 · 5 comments · Fixed by #109
Closed
4 tasks done

Consolidation of ICS3 TLA+ spec #108

adizere opened this issue Jun 18, 2020 · 5 comments · Fixed by #109
Assignees
Milestone

Comments

@adizere
Copy link
Member

adizere commented Jun 18, 2020

Summary

This repository should maintain a single version of the ICS3 TLA+ spec.

Problem Definition

We were planning to maintain two versions of the ICS3 TLA+ spec: (1) a default version of the spec, plus (2) a version that exhibits a liveness problem.

The motivation for the first version was to model (and guide the implementation of) a correct connection handshake protocol. The motivation for the second version was to provide the context for certain protocol changes surrounding UpdateClient.

Having two versions, however, puts additional maintenance burden and may be unnecessary.

Proposal

Current state of the work on ICS3 TLA+ spec:

Future, ideal state [! to be clarified]:

We'd like to keep a single version of the ICS3 TLA+ spec. This version should:

  • provide the context for the liveness problem, hence it will deadlock,
  • include the store.latestHeight + 1 fix highlighted here.
  • model check & document results with heights >= 6.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere adizere added this to the 0.6-6mo milestone Jun 18, 2020
@adizere adizere self-assigned this Jun 18, 2020
@ebuchman
Copy link
Member

provide the context for the liveness problem, hence it will deadlock,

What does this mean?

Also, in my understanding, this UpdateClient relayer race isn't a problem for the connection itself, it's a problem for a particular relayer algorithm - ie. the existence of the race is a low-level detail of how a relayer might be implemented that the Connection shouldn't care about. I would think we'd want a representation of the Connection that doesn't depend on such low level details of individual relayers, but rather on global properties of the relayer network, that wouldn't even be able to see the existence of such races, so long as properties like "the client is eventually updated, but we don't know with which height" are upheld.

That said I haven't reviewed the TLA+ spec enough yet.

@adizere
Copy link
Member Author

adizere commented Jun 22, 2020

Both versions of the ICS3 (Connection) TLA+ specs assume a high-level relayer abstraction that works like this, roughly speaking:

  1. some module pushes a message m to its outgoing buffer (here m may include a proof that the module created);
  2. the relayer (called environment) pops m;
  3. the relayer may update the destination chain if there is a proof inside m;
  4. then the relayer submits m to the input buffer of the destination module;
  5. the destination module delivers m.

In ICS3 current specs, there are no formal properties that the relayer should enforce (those properties are something we're still trying to agree on..) except for the behavior described above.

It's true that the bug is not in ICS3, but it's a bug of the "global properties of the relayer network" indeed. It's not clear in which ICS exactly this bug belongs.

What does this mean?

To "provide the context for the liveness problem" means that the disclosure log can reference this spec (i.e., this spec has enough details to describe and recreate the liveness problem; the disclosure log can quote it, so to speak). Since there is no ICS where this bug officially belongs, I classified it under a "bug that causes ICS3 liveness problems". More details here: #83.

@ebuchman
Copy link
Member

ebuchman commented Jun 22, 2020

It's true that the bug is not in ICS3, but it's a bug of the "global properties of the relayer network" indeed. It's not clear in which ICS exactly this bug belongs.

My understanding is that it's not even a bug of the global relayer network, but just of an arbitrary individual relayer. The race doesn't effect the high level properties around ensuring the packets from an outbound queue make it into an egress queue. It just impacts which relayer ends up responsible for it and how. So I would think it's actually a problem with ICS18, or with an assumption that a particular relayer will always succeed in submitting an UpdateClient, and we just need to be clear that a relayer needs to be aware this race can happen and work around it in the straight forward way (ie. an UpdateClient will succeed, but which relayer it comes from can't be guaranteed).

@milosevic
Copy link
Contributor

Agree with Bucky. The ClientUpdate liveness problem is a relayer issue, not connection handshake problem. The connection handshake protocol should assume (model) that relayer(s) should ensure that for every valid message m with a proof p, the corresponding p.height is installed on the destination before message m is sent. Modelling this could potentially be simplified by abstracting away completely client update mechanism, i.e., assuming that messages sent by modules are always valid, i.e., we don't need to verify proofs.

@adizere
Copy link
Member Author

adizere commented Jun 23, 2020

This PR is not intended to provide the full context of the ClientUpdate liveness problem.
Instead, this spec models the ICS3 protocol as it was designed to be -- that is, comprising this bug. There is now a flag in the spec, called Concurrency which:

  • if TRUE, then the environment can non-deterministically advance the height or update the light client of a chain.
    This configuration simulates a liveness problem caused by the way UpdateClient is used by relayers, and will lead the model to stutter. See more details in the disclosure log.
    To be clear: the stuttering is not caused by a bug in the ICS3 protocol itself; this model simply captures the original (faulty) algorithms surrounding the ICS3 protocol.
  • if FALSE, then the model should check correctly. - if FALSE, then the model should check correctly.

@adizere adizere linked a pull request Jul 3, 2020 that will close this issue
6 tasks
@adizere adizere closed this as completed Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants