-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes the net_dynamic_hb test #372
Conversation
src/sender_queue/mod.rs
Outdated
if let Some(q) = self.outgoing_queue.get(id) { | ||
if q.keys().any(|epoch| epoch <= last_epoch) { | ||
if id == self.our_id() { | ||
// TODO: Schedule our own shutdown? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine the way it is; the user can now drop the object or restart.
tests/net_dynamic_hb.rs
Outdated
@@ -121,7 +131,8 @@ fn do_drop_and_readd(cfg: TestConfig) { | |||
|
|||
// We generate a list of transaction we want to propose, for each node. All nodes will propose | |||
// a number between 0..total_txs, chosen randomly. | |||
let mut queues: collections::BTreeMap<_, Vec<usize>> = net | |||
let mut queues: BTreeMap<_, Vec<usize>> = state | |||
.get_net_mut() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is get_net_mut
required? (Also below.)
tests/net_dynamic_hb.rs
Outdated
if node_id != pivot_node_id | ||
&& awaiting_addition_input.contains(&node_id) | ||
&& state.shutdown_epoch.is_some() | ||
&& era + hb_epoch == state.shutdown_epoch.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That restriction shouldn't be necessary, I think? We can vote for re-adding in any epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails without the validators waiting for this epoch. Is this incorrect, you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. At least it shouldn't fail with >=
instead of ==
, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. It doesn't fail with >=
but does fail without this condition at all, that is if the readd input comes before the node is removed. I think that this condition can be replaced with a query of is_removed
property of the pivot node but that's technically cheating because the validator should be able to decide without knowing the local state of the pivot node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'd just use >=
then. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to review the bulk of the actual test changes, submitting now to not tread on @afck's work.
I am a bit concerned over the filter that filters messages for removed nodes, it seems to remove almost all usefulness from the error that gets thrown when messages are sent to non-existing nodes.
I: Iterator<Item = N>, | ||
{ | ||
if !self.is_removed { | ||
// TODO: return an error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely return an error if this is a possible occurance. Otherwise, maybe turn this into an assert!
?
It's not a case of an idempotent operation being okay if called twice though, right? is_removed
won't be affected by this function at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_removed
is reset to false
by this function as a result of calling SenderQueueBuilder::build
. So, calling it twice in a row is an error.
At the moment the type of SenderQueue
error is the same as the type of underlying DistAlgorithm
error. It looks like adding an error variant in dynamic_honey_badger::error::Error
to be used in the sender queue is not optimal. Possibly I have to introduce a wrapper for errors in sender_queue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like an unfortunate situation. The way you describe it, a proper error type for the sender queue is probably correct though.
I'm finished responding to comments. It appears some new functions in
Only messages to removed nodes are filtered out. If the node is not removed using the provided method |
tests/net_dynamic_hb.rs
Outdated
if node_id != pivot_node_id | ||
&& awaiting_addition_input.contains(&node_id) | ||
&& state.shutdown_epoch.is_some() | ||
&& era + hb_epoch == state.shutdown_epoch.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. At least it shouldn't fail with >=
instead of ==
, I think?
tests/net_dynamic_hb.rs
Outdated
if node_id != pivot_node_id | ||
&& awaiting_addition_input.contains(&node_id) | ||
&& state.shutdown_epoch.is_some() | ||
&& era + hb_epoch == state.shutdown_epoch.unwrap() | ||
{ | ||
// Now we can add the node again. Public keys will be reused. | ||
let step = state | ||
.get_net_mut() | ||
let _ = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a Step
, isn't it? Let's assert!
that its output is empty instead of dropping it right away.
(We don't actually expect the step to be completely empty, because it will contain a SignedVote
message, right?)
I think it's problematic to add these automatic removal features to the simulated networking code, as this seems a bit like fixing the tests instead of fixing the code. We do have to decide what we want the In a real networking environment, packets destined for non-existing nodes would either silently be dropped or be handled by an upper layer that provided the encrypted and reliable peer connections. Inside Now there are some cases in which messages into the void are unavoidable, due to the fact that they are delivered and acted upon in essentially random order. This should be a special case though, which could be handled by simply adding a flag to virtual net that, when set, makes it okay for messages for non-existing nodes to be discarded. This could be turned on and off for short intervals. Keep in mind that we probably would still need to purge the queue of "stale" messages meant for the old one, if we are reusing the node ID, because in a real life situation those messages would simply be dropped due to failing signatures/connections by lower level layers. At that point we have to question the merits of reusing the ID though. Furthermore, as @afck pointed out, due to changes in the algorithm model itself, the test would nowadays probably be better served by removing a (random) subset of validators and adding a random amount, instead of just a single one. To summarize: I think adding all these special filters to make the code work for this specific test is adding a lot of complexity which bears little value. The only remaining value would be catching algorithms that generate packets for IDs that were never valid. We should either drop the error-on-nonexisting-ID feature altogether, or add a feature to turn it off or suspend it. |
I agree with all of @mbr's points. ➕ |
These are all very thoughtful comments, thanks @mbr and @afck.
Fixing the node removal feature in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with merging this now. But let's keep the above points in mind. (Maybe we should create issues for them.)
948b126
to
242568a
Compare
Fixes #369.
The fix mostly amounts to improving how removed nodes are handled in
VirtualNet
.The sender queue algorithm now flags itself if it was removed. The user can check the flag
is_removed
to shut down the node.