Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort query if not enough stake is connected #3345

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Aug 28, 2024

This pull request makes the snowman engine to abort early and avoid sending a query if the node considers the connected stake as insufficient to achieve a successful poll.

The rational for this, is to prevent the node to try to agree on blocks when it is in a partition.

The intent here is twofold: (1) To prevent needless resource expenditure, and (2) to prevent snowball from increasing its preference strength while the protocol has no way to make progress.

The higher preference strength in a partition, the longer it takes to recover and achieve consensus once the partition heals.

Why this should be merged

Without this change, nodes in a partition consistently try to poll each other and depending on the configuration, it may be that all nodes in a partition manage to increase the preference strength over time in snowball (but fail to finalize blocks, because they cannot consistently achieve the required confidence thresholds).
Once the nodes have increased the preference strength of the next block they prefer, even if the partition heals, it will take a long time for the nodes to change their preference, and the consensus will stall for a long time or until a restart of the nodes.

With this change, the nodes will avoid sending polls once they detect they are in a partition, and thus the preference strength of snowball won't increase needlessly.

How this works

Before sending a query, a node checks if it senses enough stake is reachable, and aborts early if it is not the case.

How this was tested

I ran a Fuji node with verbose logging level, and blocked its communication to 33% of the stake via iptables rules.
Then, without this change, the node still tries to query other nodes.
However, with this change, the node aborts and does not query anymore:
Screenshot 2024-08-30 at 01 34 41

Instead, it prints that it failed fast and aborted the query:
Screenshot 2024-08-30 at 00 13 35

I also did a more comprehensive test:

I deployed a 4 node subnet on Fuji (testnet) and then setup a network partition where two nodes are on one side and the other two nodes are on the other.
Then I submitted transactions to some nodes on each side of the partition, and monitored a custom metric I added which measures the preference strength of snowball. Afterwards, I let the partition heal and attempted to submit more transactions to the network.

I repeated the experiment above for both a build from master and a build based on this fix.

In the build based on master, i observed the snowball preference strength go wild, and the network was not functional after the partition healed:

image (1)

In the build based on this fix, the metric didn't change as the queries are not being sent in the first place.
After the partition healed, conflicting blocks were rejected as consensus was re-established:

image

@yacovm yacovm marked this pull request as draft August 28, 2024 22:41
@yacovm yacovm force-pushed the throttlePolls branch 3 times, most recently from f795532 to 5ad5305 Compare August 30, 2024 00:20
@yacovm yacovm changed the title Throttle polls Abort query if not enough stake is connected Aug 30, 2024
@yacovm yacovm marked this pull request as ready for review August 30, 2024 00:34
@yacovm yacovm force-pushed the throttlePolls branch 2 times, most recently from 9d7193a to 19a7551 Compare September 4, 2024 17:45
@yacovm yacovm force-pushed the throttlePolls branch 3 times, most recently from 31ce708 to 86d4b0c Compare September 5, 2024 17:04
snow/engine/snowman/engine.go Show resolved Hide resolved
snow/engine/snowman/engine.go Outdated Show resolved Hide resolved
snow/engine/snowman/engine_test.go Show resolved Hide resolved
@yacovm yacovm force-pushed the throttlePolls branch 2 times, most recently from a922d0f to 1529dde Compare September 5, 2024 19:07
This commit makes the snowman engine to abort early and avoid sending a query
if the node considers the connected stake as insufficient to achieve a successful poll.

The rational for this, is to prevent the node to try to agree on blocks when it is in a partition.
The intent here is twofold: (1) To prevent needless resource expenditure,
and (2) to prevent snowball from increasing its preference strength while the protocol has no way to make progress.
The higher preference strength in a partition, the longer it takes to recover and achieve consensus once the partition heals.

Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
@StephenButtolph StephenButtolph added this pull request to the merge queue Sep 6, 2024
@StephenButtolph StephenButtolph added this to the v1.11.12 milestone Sep 6, 2024
@StephenButtolph StephenButtolph added consensus This involves consensus incident response labels Sep 6, 2024
Merged via the queue into ava-labs:master with commit 6e1a905 Sep 6, 2024
21 checks passed
michaelkaplan13 pushed a commit that referenced this pull request Sep 11, 2024
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus This involves consensus incident response
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants