Fixes the net_dynamic_hb test #372

vkomenda · 2018-12-27T19:05:07Z

Fixes #369.

The fix mostly amounts to improving how removed nodes are handled in VirtualNet.

The sender queue algorithm now flags itself if it was removed. The user can check the flag is_removed to shut down the node.

afck · 2018-12-28T15:05:28Z

src/sender_queue/mod.rs

-            if let Some(q) = self.outgoing_queue.get(id) {
-                if q.keys().any(|epoch| epoch <= last_epoch) {
+        if id == self.our_id() {
+            // TODO: Schedule our own shutdown?


I think it's fine the way it is; the user can now drop the object or restart.

afck · 2018-12-28T15:21:22Z

tests/net_dynamic_hb.rs

@@ -121,7 +131,8 @@ fn do_drop_and_readd(cfg: TestConfig) {

    // We generate a list of transaction we want to propose, for each node. All nodes will propose
    // a number between 0..total_txs, chosen randomly.
-    let mut queues: collections::BTreeMap<_, Vec<usize>> = net
+    let mut queues: BTreeMap<_, Vec<usize>> = state
+        .get_net_mut()


Why is get_net_mut required? (Also below.)

afck · 2018-12-28T15:26:17Z

tests/net_dynamic_hb.rs

+        if node_id != pivot_node_id
+            && awaiting_addition_input.contains(&node_id)
+            && state.shutdown_epoch.is_some()
+            && era + hb_epoch == state.shutdown_epoch.unwrap()


That restriction shouldn't be necessary, I think? We can vote for re-adding in any epoch.

It fails without the validators waiting for this epoch. Is this incorrect, you think?

Not sure. At least it shouldn't fail with >= instead of ==, I think?

That's right. It doesn't fail with >= but does fail without this condition at all, that is if the readd input comes before the node is removed. I think that this condition can be replaced with a query of is_removed property of the pivot node but that's technically cheating because the validator should be able to decide without knowing the local state of the pivot node.

Right, I'd just use >= then. 👍

mbr

Still need to review the bulk of the actual test changes, submitting now to not tread on @afck's work.

I am a bit concerned over the filter that filters messages for removed nodes, it seems to remove almost all usefulness from the error that gets thrown when messages are sent to non-existing nodes.

mbr · 2018-12-28T13:57:29Z

src/sender_queue/dynamic_honey_badger.rs

+        I: Iterator<Item = N>,
+    {
+        if !self.is_removed {
+            // TODO: return an error?


We should definitely return an error if this is a possible occurance. Otherwise, maybe turn this into an assert!?

It's not a case of an idempotent operation being okay if called twice though, right? is_removed won't be affected by this function at all?

is_removed is reset to false by this function as a result of calling SenderQueueBuilder::build. So, calling it twice in a row is an error.

At the moment the type of SenderQueue error is the same as the type of underlying DistAlgorithm error. It looks like adding an error variant in dynamic_honey_badger::error::Error to be used in the sender queue is not optimal. Possibly I have to introduce a wrapper for errors in sender_queue.

That sounds like an unfortunate situation. The way you describe it, a proper error type for the sender queue is probably correct though.

tests/net/err.rs

tests/net/mod.rs

vkomenda · 2018-12-28T22:08:52Z

I'm finished responding to comments. It appears some new functions in net_dynamic_hb weren't required. I refactored those out.

I am a bit concerned over the filter that filters messages for removed nodes, it seems to remove almost all usefulness from the error that gets thrown when messages are sent to non-existing nodes.

Only messages to removed nodes are filtered out. If the node is not removed using the provided method remove_node and yet disappears, a message to it still triggers an error in crank.

src/sender_queue/error.rs

afck · 2018-12-29T14:55:23Z

tests/net_dynamic_hb.rs

+        if node_id != pivot_node_id
+            && awaiting_addition_input.contains(&node_id)
+            && state.shutdown_epoch.is_some()
+            && era + hb_epoch == state.shutdown_epoch.unwrap()


Not sure. At least it shouldn't fail with >= instead of ==, I think?

afck · 2018-12-29T15:02:39Z

tests/net_dynamic_hb.rs

        if node_id != pivot_node_id
            && awaiting_addition_input.contains(&node_id)
            && state.shutdown_epoch.is_some()
            && era + hb_epoch == state.shutdown_epoch.unwrap()
        {
            // Now we can add the node again. Public keys will be reused.
-            let step = state
-                .get_net_mut()
+            let _ = state


This is still a Step, isn't it? Let's assert! that its output is empty instead of dropping it right away.
(We don't actually expect the step to be completely empty, because it will contain a SignedVote message, right?)

mbr · 2018-12-30T12:23:11Z

Now I started thinking that the messages should be removed earlier, at dispatch time. That way we can guarantee that no messages to the removed node arrive to it when it is finally readded. That would have been an undesirable side-effect.

I think it's problematic to add these automatic removal features to the simulated networking code, as this seems a bit like fixing the tests instead of fixing the code. We do have to decide what we want the VirtualNet to be, at the moment it is meant to be more strict than an actual network.

In a real networking environment, packets destined for non-existing nodes would either silently be dropped or be handled by an upper layer that provided the encrypted and reliable peer connections. Inside VirtualNet, any stray packet is considered an error though, because we aim to implement in a way that these guaranteed to be useless messages are not even generated.

Now there are some cases in which messages into the void are unavoidable, due to the fact that they are delivered and acted upon in essentially random order. This should be a special case though, which could be handled by simply adding a flag to virtual net that, when set, makes it okay for messages for non-existing nodes to be discarded. This could be turned on and off for short intervals. Keep in mind that we probably would still need to purge the queue of "stale" messages meant for the old one, if we are reusing the node ID, because in a real life situation those messages would simply be dropped due to failing signatures/connections by lower level layers.

At that point we have to question the merits of reusing the ID though. Furthermore, as @afck pointed out, due to changes in the algorithm model itself, the test would nowadays probably be better served by removing a (random) subset of validators and adding a random amount, instead of just a single one.

To summarize: I think adding all these special filters to make the code work for this specific test is adding a lot of complexity which bears little value. The only remaining value would be catching algorithms that generate packets for IDs that were never valid. We should either drop the error-on-nonexisting-ID feature altogether, or add a feature to turn it off or suspend it.

afck · 2018-12-30T15:07:52Z

I agree with all of @mbr's points. ➕
But let's try to find a way to avoid turning this PR into a major tests/net extension, in addition to the test fix.

vkomenda · 2018-12-31T11:32:17Z

These are all very thoughtful comments, thanks @mbr and @afck.

I think it's problematic to add these automatic removal features to the simulated networking code, as this seems a bit like fixing the tests instead of fixing the code.

Fixing the node removal feature in VirtualNet is probably the most important part of this PR. Without it the test was struggling to fight off the ghost messages addressed to the removed node that were still sitting in the VirtualNet queue. I would stress that this feature based on a set of removed nodes is more selective than a flag might have been since it only allows DynamicHoneyBadger work with messages of the matching "lifetime", and does that without a major overhaul.

afck

I'm happy with merging this now. But let's keep the above points in mind. (Maybe we should create issues for them.)

src/sender_queue/error.rs

vkomenda requested review from mbr and afck December 27, 2018 19:05

afck reviewed Dec 28, 2018

View reviewed changes

mbr reviewed Dec 28, 2018

View reviewed changes

afck reviewed Dec 29, 2018

View reviewed changes

afck approved these changes Jan 2, 2019

View reviewed changes

src/sender_queue/error.rs Show resolved Hide resolved

vkomenda added 9 commits January 2, 2019 12:29

started waiting for a full epoch after node removal in net_dynamic_hb

96648b1

clarified the use of the stored join plan

11898b2

go back to rejoining the node in the same epoch it was removed

2bdc686

cleanup of debug prints

bdae37d

clippy lints and more cleanup

940afe1

cleaned up unused methods

5b1979d

review comments; cleaned up net_dynamic_hb

069b349

relaxed the condition on the readd input epoch

27dfa06

updated the fault error in tests

242568a

vkomenda force-pushed the fix-net-dynamic-hb branch from 948b126 to 242568a Compare January 2, 2019 13:04

vkomenda merged commit 742ad7b into master Jan 3, 2019

vkomenda deleted the fix-net-dynamic-hb branch January 3, 2019 09:22

vkomenda mentioned this pull request Jan 3, 2019

Design handling of messages to non-existent nodes in VirtualNet #375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes the net_dynamic_hb test #372

Fixes the net_dynamic_hb test #372

vkomenda commented Dec 27, 2018

afck Dec 28, 2018

afck Dec 28, 2018

afck Dec 28, 2018

vkomenda Dec 28, 2018

afck Dec 29, 2018

vkomenda Dec 29, 2018

afck Dec 30, 2018

mbr left a comment

mbr Dec 28, 2018

vkomenda Dec 28, 2018

mbr Dec 30, 2018

vkomenda commented Dec 28, 2018

afck Dec 29, 2018

afck Dec 29, 2018

mbr commented Dec 30, 2018

afck commented Dec 30, 2018

vkomenda commented Dec 31, 2018

afck left a comment

Fixes the net_dynamic_hb test #372

Fixes the net_dynamic_hb test #372

Conversation

vkomenda commented Dec 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkomenda commented Dec 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbr commented Dec 30, 2018

afck commented Dec 30, 2018

vkomenda commented Dec 31, 2018

afck left a comment

Choose a reason for hiding this comment