Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate light node sync errors to RPC caller #1783

Merged
merged 1 commit into from
Aug 17, 2020

Conversation

Thegaram
Copy link
Contributor

@Thegaram Thegaram commented Aug 17, 2020

Overview

Currently, when the light node sync layer encounters an error (e.g. invalid proof), it does not report the error to the RPC caller but will silently retry. The caller will timeout after a while but there is no way to tell what caused the timeout.

This PR allows the sync layer to propagate errors to the RPC layer. Note that, even if an RPC call fails, the sync layer will still retry the query; this way a repeated query can succeed with reduced latency.

Architecture

Currently, light node modules use two separate runtimes.

The sync layer runs on top of our mio event loop, similar to full nodes. Query modules (RPC, QueryService), on the other hand, run on a tokio future executor. In terms of query workflow, RPC requests will notify the sync layer to retrieve the corresponding item. When the item has been received and verified, all futures waiting for it are notified (wake).

This latter notification is done using a shared-memory cache. For instance, the transactions sync handler has a field:

verified: Arc<RwLock<LruCache<H256, PendingItem<SignedTransaction, ClonableError>>>>,

Originally, PendingItem only had two states: Pending and Ready. In this PR, this is extended with a third state Error. The corresponding future FutureItem will now resolve to Result<T, Err> instead of just T.

Implementation challenges

error-chain errors are neither Clone [1] [2] nor Sync [3] [4]. Unfortunately, we need both of these. We need Clone as the same error will be consumed by the sync handler (for disconnecting the peer) and by one or more RPC handlers (to report the error or retry). We need Sync as futures can be scheduled on different threads at different points in time when using a tokio executor.

As a hacky workaround, I use a wrapper when I need clonable errors:

pub struct ClonableError(Arc<Mutex<Error>>);

impl Into<Error> for ClonableError { ... }
impl From<Error> for ClonableError { ... }

... and when I need to use this as a normal Error, I wrap this into a special error kind:

ClonableErrorWrapper(error: ClonableError) {
    description("Clonable error"),
    display("{:?}", error.0.lock().to_string()),
}

The other changes in this PR are mostly adding more details to various error types and some refactoring.


This change is Reviewable

Copy link
Contributor

@sparkmiw sparkmiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 18 of 18 files at r1.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @peilun-conflux and @yangzhe1990)

@Thegaram Thegaram merged commit 34a5db8 into Conflux-Chain:master Aug 17, 2020
@Thegaram Thegaram deleted the light-node-error-refactor branch August 17, 2020 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants