Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the net_dynamic_hb test #372

Merged
merged 9 commits into from
Jan 3, 2019
Merged

Fixes the net_dynamic_hb test #372

merged 9 commits into from
Jan 3, 2019

Conversation

vkomenda
Copy link
Contributor

Fixes #369.

The fix mostly amounts to improving how removed nodes are handled in VirtualNet.

The sender queue algorithm now flags itself if it was removed. The user can check the flag is_removed to shut down the node.

@vkomenda vkomenda requested review from mbr and afck December 27, 2018 19:05
if let Some(q) = self.outgoing_queue.get(id) {
if q.keys().any(|epoch| epoch <= last_epoch) {
if id == self.our_id() {
// TODO: Schedule our own shutdown?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine the way it is; the user can now drop the object or restart.

@@ -121,7 +131,8 @@ fn do_drop_and_readd(cfg: TestConfig) {

// We generate a list of transaction we want to propose, for each node. All nodes will propose
// a number between 0..total_txs, chosen randomly.
let mut queues: collections::BTreeMap<_, Vec<usize>> = net
let mut queues: BTreeMap<_, Vec<usize>> = state
.get_net_mut()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is get_net_mut required? (Also below.)

if node_id != pivot_node_id
&& awaiting_addition_input.contains(&node_id)
&& state.shutdown_epoch.is_some()
&& era + hb_epoch == state.shutdown_epoch.unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That restriction shouldn't be necessary, I think? We can vote for re-adding in any epoch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails without the validators waiting for this epoch. Is this incorrect, you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. At least it shouldn't fail with >= instead of ==, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It doesn't fail with >= but does fail without this condition at all, that is if the readd input comes before the node is removed. I think that this condition can be replaced with a query of is_removed property of the pivot node but that's technically cheating because the validator should be able to decide without knowing the local state of the pivot node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'd just use >= then. 👍

Copy link
Contributor

@mbr mbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review the bulk of the actual test changes, submitting now to not tread on @afck's work.

I am a bit concerned over the filter that filters messages for removed nodes, it seems to remove almost all usefulness from the error that gets thrown when messages are sent to non-existing nodes.

I: Iterator<Item = N>,
{
if !self.is_removed {
// TODO: return an error?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely return an error if this is a possible occurance. Otherwise, maybe turn this into an assert!?

It's not a case of an idempotent operation being okay if called twice though, right? is_removed won't be affected by this function at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_removed is reset to false by this function as a result of calling SenderQueueBuilder::build. So, calling it twice in a row is an error.

At the moment the type of SenderQueue error is the same as the type of underlying DistAlgorithm error. It looks like adding an error variant in dynamic_honey_badger::error::Error to be used in the sender queue is not optimal. Possibly I have to introduce a wrapper for errors in sender_queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like an unfortunate situation. The way you describe it, a proper error type for the sender queue is probably correct though.

tests/net/err.rs Outdated Show resolved Hide resolved
tests/net/err.rs Outdated Show resolved Hide resolved
tests/net/err.rs Outdated Show resolved Hide resolved
tests/net/mod.rs Outdated Show resolved Hide resolved
@vkomenda
Copy link
Contributor Author

I'm finished responding to comments. It appears some new functions in net_dynamic_hb weren't required. I refactored those out.

I am a bit concerned over the filter that filters messages for removed nodes, it seems to remove almost all usefulness from the error that gets thrown when messages are sent to non-existing nodes.

Only messages to removed nodes are filtered out. If the node is not removed using the provided method remove_node and yet disappears, a message to it still triggers an error in crank.

src/sender_queue/error.rs Show resolved Hide resolved
if node_id != pivot_node_id
&& awaiting_addition_input.contains(&node_id)
&& state.shutdown_epoch.is_some()
&& era + hb_epoch == state.shutdown_epoch.unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. At least it shouldn't fail with >= instead of ==, I think?

if node_id != pivot_node_id
&& awaiting_addition_input.contains(&node_id)
&& state.shutdown_epoch.is_some()
&& era + hb_epoch == state.shutdown_epoch.unwrap()
{
// Now we can add the node again. Public keys will be reused.
let step = state
.get_net_mut()
let _ = state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a Step, isn't it? Let's assert! that its output is empty instead of dropping it right away.
(We don't actually expect the step to be completely empty, because it will contain a SignedVote message, right?)

@mbr
Copy link
Contributor

mbr commented Dec 30, 2018

Now I started thinking that the messages should be removed earlier, at dispatch time. That way we can guarantee that no messages to the removed node arrive to it when it is finally readded. That would have been an undesirable side-effect.

I think it's problematic to add these automatic removal features to the simulated networking code, as this seems a bit like fixing the tests instead of fixing the code. We do have to decide what we want the VirtualNet to be, at the moment it is meant to be more strict than an actual network.

In a real networking environment, packets destined for non-existing nodes would either silently be dropped or be handled by an upper layer that provided the encrypted and reliable peer connections. Inside VirtualNet, any stray packet is considered an error though, because we aim to implement in a way that these guaranteed to be useless messages are not even generated.

Now there are some cases in which messages into the void are unavoidable, due to the fact that they are delivered and acted upon in essentially random order. This should be a special case though, which could be handled by simply adding a flag to virtual net that, when set, makes it okay for messages for non-existing nodes to be discarded. This could be turned on and off for short intervals. Keep in mind that we probably would still need to purge the queue of "stale" messages meant for the old one, if we are reusing the node ID, because in a real life situation those messages would simply be dropped due to failing signatures/connections by lower level layers.

At that point we have to question the merits of reusing the ID though. Furthermore, as @afck pointed out, due to changes in the algorithm model itself, the test would nowadays probably be better served by removing a (random) subset of validators and adding a random amount, instead of just a single one.

To summarize: I think adding all these special filters to make the code work for this specific test is adding a lot of complexity which bears little value. The only remaining value would be catching algorithms that generate packets for IDs that were never valid. We should either drop the error-on-nonexisting-ID feature altogether, or add a feature to turn it off or suspend it.

@afck
Copy link
Collaborator

afck commented Dec 30, 2018

I agree with all of @mbr's points. ➕
But let's try to find a way to avoid turning this PR into a major tests/net extension, in addition to the test fix.

@vkomenda
Copy link
Contributor Author

These are all very thoughtful comments, thanks @mbr and @afck.

I think it's problematic to add these automatic removal features to the simulated networking code, as this seems a bit like fixing the tests instead of fixing the code.

Fixing the node removal feature in VirtualNet is probably the most important part of this PR. Without it the test was struggling to fight off the ghost messages addressed to the removed node that were still sitting in the VirtualNet queue. I would stress that this feature based on a set of removed nodes is more selective than a flag might have been since it only allows DynamicHoneyBadger work with messages of the matching "lifetime", and does that without a major overhaul.

Copy link
Collaborator

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with merging this now. But let's keep the above points in mind. (Maybe we should create issues for them.)

src/sender_queue/error.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix sender queue messaging when a validator is removed and then rejoined
3 participants