Propagate light node sync errors to RPC caller #1783

Thegaram · 2020-08-17T08:32:04Z

Overview

Currently, when the light node sync layer encounters an error (e.g. invalid proof), it does not report the error to the RPC caller but will silently retry. The caller will timeout after a while but there is no way to tell what caused the timeout.

This PR allows the sync layer to propagate errors to the RPC layer. Note that, even if an RPC call fails, the sync layer will still retry the query; this way a repeated query can succeed with reduced latency.

Architecture

Currently, light node modules use two separate runtimes.

The sync layer runs on top of our mio event loop, similar to full nodes. Query modules (RPC, QueryService), on the other hand, run on a tokio future executor. In terms of query workflow, RPC requests will notify the sync layer to retrieve the corresponding item. When the item has been received and verified, all futures waiting for it are notified (wake).

This latter notification is done using a shared-memory cache. For instance, the transactions sync handler has a field:

verified: Arc<RwLock<LruCache<H256, PendingItem<SignedTransaction, ClonableError>>>>,

Originally, PendingItem only had two states: Pending and Ready. In this PR, this is extended with a third state Error. The corresponding future FutureItem will now resolve to Result<T, Err> instead of just T.

Implementation challenges

error-chain errors are neither Clone [1] [2] nor Sync [3] [4]. Unfortunately, we need both of these. We need Clone as the same error will be consumed by the sync handler (for disconnecting the peer) and by one or more RPC handlers (to report the error or retry). We need Sync as futures can be scheduled on different threads at different points in time when using a tokio executor.

As a hacky workaround, I use a wrapper when I need clonable errors:

pub struct ClonableError(Arc<Mutex<Error>>);

impl Into<Error> for ClonableError { ... }
impl From<Error> for ClonableError { ... }

... and when I need to use this as a normal Error, I wrap this into a special error kind:

ClonableErrorWrapper(error: ClonableError) {
    description("Clonable error"),
    display("{:?}", error.0.lock().to_string()),
}

The other changes in this PR are mostly adding more details to various error types and some refactoring.

This change is

sparkmiw

Reviewed 18 of 18 files at r1.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @peilun-conflux and @yangzhe1990)

propagate light node sync errors to RPC caller

cbdf204

Thegaram requested review from yangzhe1990, sparkmiw and peilun-conflux August 17, 2020 08:32

sparkmiw approved these changes Aug 17, 2020

View reviewed changes

Thegaram merged commit 34a5db8 into Conflux-Chain:master Aug 17, 2020

Thegaram deleted the light-node-error-refactor branch August 17, 2020 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate light node sync errors to RPC caller #1783

Propagate light node sync errors to RPC caller #1783

Thegaram commented Aug 17, 2020 •

edited by resodo

Loading

sparkmiw left a comment

Propagate light node sync errors to RPC caller #1783

Propagate light node sync errors to RPC caller #1783

Conversation

Thegaram commented Aug 17, 2020 • edited by resodo Loading

sparkmiw left a comment

Choose a reason for hiding this comment

Thegaram commented Aug 17, 2020 •

edited by resodo

Loading