Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prospective-parachains rework #4035

Merged
merged 67 commits into from
May 13, 2024

Conversation

alindima
Copy link
Contributor

@alindima alindima commented Apr 9, 2024

Reworks prospective-parachains so that we allow a number of unconnected candidates (for which we don't know the parent candidate yet). Needed for elastic scaling: #3541. Without this, candidate B will not be validated and backed until candidate A (its parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so that we don't concern ourselves with candidates which form cycles or which form parachain forks. We now have "Fragment chains" instead of "Fragment trees". This greatly simplifies some of the code and is a compromise we can make. We just need to make sure that cycle-producing parachains don't brick the relay chain and that fork-producing parachains can still make some progress (on one core at least). The only forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the CandidateStorage and whenever a new candidate is introduced, we try to repopulate the chain with as many candidates as we can.

Also fixes #3219

Guide changes will be done as part of: #3699

TODOs:

  • see if we can replace the Cow over the candidate commitments with an Arc over the entire ProspectiveCandidate. It's only being overwritten in unit tests. We can work around that.
  • finish fragment_chain unit tests
  • add more prospective-parachains subsystem tests
  • test with zombienet what happens if a parachain is creating cycles (it should not brick the relay chain).
  • test with zombienet a parachain that is creating forks. it should keep producing blocks from time to time (one bad collator should not DOS the parachain, even if throughput decreases)
  • add some more logs and metrics
  • add prdoc and remove the "silent" label

@alindima alindima marked this pull request as draft April 9, 2024 06:53
@alindima alindima added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Apr 10, 2024
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
@alindima
Copy link
Contributor Author

alindima commented May 9, 2024

no issues found when testing on versi. it's ready for some approvals @sandreim @tdimitrov @alexggh 😁

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work here and testing on Versi @alindima!

Just have some more nits and questions, but anyway good to go.


if (self.chain.len() + unconnected) < self.scope.max_depth {
PotentialAddition::Anyhow
} else if (self.chain.len() + unconnected) == self.scope.max_depth {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if we are already at max_depth we can add one more ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good question.

Even prior to this PR, we were accepting max_depth + 1 unincluded candidates at all times.
Not sure why it was designed this way. I assume it's because a max_candidate_depth of 0 is equal to synchronous backing and would still allow for one candidate to be backed (therefore the +1 is needed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that makes some sense indeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is that now we have async backing enabled do we still want to keep this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is that now we have async backing enabled do we still want to keep this ?

I don't have a good reason for chaning this. Moreover, we seem to have discovered that we need a max_depth + 1 unincluded segment size also on the collator. So maybe it's here for a reason.
Nevertheless, I'd leave it as is unless we find a good reason to change it

let Some(relay_parent) = self.scope.ancestor(relay_parent) else { return false };

if relay_parent.number < earliest_rp.number {
return false // relay parent moved backwards.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to log trace here just in case for debugging ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it helps, because this function is also called when pruning the candidate storage on a new leaf update.
therefore, it's bound to happen most of the times under normal circumstances

polkadot/node/core/prospective-parachains/src/metrics.rs Outdated Show resolved Hide resolved
polkadot/node/core/prospective-parachains/src/metrics.rs Outdated Show resolved Hide resolved
/// access to the parent's HeadData. Can be retried once the candidate outputting this head data is
/// seconded.
#[derive(Debug, Clone, Eq, PartialEq, Hash)]
pub struct BlockedCollationId {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really an ID ? Just BlockedCollation maybe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only the key used in the HashMap, hence it's an ID. It maps to a fetched collation

- audience: Node Dev
description: |
Reworks prospective-parachains so that we allow a number of unconnected candidates (for which we don't yet know
the parent candidate). Needed for elastic scaling. Also simplifies it to not allow parachain forks and cycles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd first mention that this fundamentally changes what information the subsystem stores and operates on. From a tree to just a chain and bunch of unconnected candidates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworked the prdoc, let me know if it sounds better now

Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Left some nits, couldn't find any problems within it.

if self.is_fork_or_cycle(
candidate.parent_head_data_hash(),
Some(candidate.output_head_data_hash()),
) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect this to happen, right ? So let's log a debug if it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it can happen if there's a parachain fork or cycle on another relay chain fork. the candidate storage holds all candidates across all relay parents. so when populating the chain, we may attempt to introduce a candidate that is present on another relay chain fork (but those candidates will probably not pass these sanity checks). but it may not neccesarily mean something's wrong

polkadot/node/core/prospective-parachains/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tdimitrov tdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @alindima

@alindima alindima enabled auto-merge May 13, 2024 13:04
@alindima alindima added this pull request to the merge queue May 13, 2024
Merged via the queue into master with commit d36da12 May 13, 2024
146 of 148 checks passed
@alindima alindima deleted the alindima/prospective-parachains-unconnected branch May 13, 2024 15:09
ordian added a commit that referenced this pull request May 14, 2024
* master:
  improve MockValidationDataInherentDataProvider to support async backing (#4442)
  Bump `proc-macro-crate` to the latest version (#4409)
  [ci] Run check-runtime-migration in GHA (#4441)
  prospective-parachains rework (#4035)
  [ci] Add forklift to GHA ARC (#4372)
  `CheckWeight` SE: Check for extrinsic length + proof size combined (#4326)
  Add generate and verify logic for `AncestryProof` (#4430)
  Rococo AH: undeploy trie migration (#4414)
  Remove `substrate-frame-cli` (#4403)
  migrations: `take()`should consume read and write operation weight (#4302)
  `remote-externalities`: store block header in snapshot (#4349)
  xcm-emlator: Use `BlockNumberFor` instead of `parachains_common::BlockNumber=u32` (#4434)
  Remove `pallet::getter` usage from authority-discovery pallet (#4091)
  Remove pallet::getter usage from pallet-contracts-mock-network (#4417)
  Add docs to request_core_count (#4423)
@kamilsa
Copy link

kamilsa commented May 23, 2024

Are fragment chains backwards compatible with fragment trees? Should we do the corresponding updates in KAGOME ASAP?

@alindima
Copy link
Contributor Author

Fragment trees would maintain parachain forks of backable candidates. In the end, the provisioner subsystem on the block author would only choose the first one to back.

Now, fragment chains enable validators to not maintain parachain forks. If the block author gets two backable candidates that have the same parent, it will not even fetch/store the second one.

We make the assumption that parachains shouldn't be creating forks (often or ever) and if they do, they won't utilise the full throughput.

This shouldn't break compatibility with nodes that still hold fragment trees. We don't expect every validator to use this implementation rightaway.

That being said, this PR is more than a switch from fragment-trees to fragment chains. It enables parallel validation of candidates of the same para across different backing groups (needed for elastic scaling).
Whether or not kagome should implement this ASAP depends on your goals, but it shoulnd't break compatibility.

hitchhooker pushed a commit to ibp-network/polkadot-sdk that referenced this pull request Jun 5, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
liuchengxu pushed a commit to liuchengxu/polkadot-sdk that referenced this pull request Jun 19, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
Copy link

@cryptokin92 cryptokin92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good update

TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
Reworks prospective-parachains so that we allow a number of unconnected
candidates (for which we don't know the parent candidate yet). Needed
for elastic scaling:
paritytech#3541. Without this,
candidate B will not be validated and backed until candidate A (its
parent) is validated and a backing statement reaches the validator.

Due to the high complexity of the subsystem, I rewrote parts of it so
that we don't concern ourselves with candidates which form cycles or
which form parachain forks. We now have "Fragment chains" instead of
"Fragment trees". This greatly simplifies some of the code and is a
compromise we can make. We just need to make sure that cycle-producing
parachains don't brick the relay chain and that fork-producing
parachains can still make some progress (on one core at least). The only
forks that are allowed are those on the relay chain, obviously.

Unconnected candidates are kept in the `CandidateStorage` and whenever a
new candidate is introduced, we try to repopulate the chain with as many
candidates as we can.

Also fixes paritytech#3219

Guide changes will be done as part of:
paritytech#3699

TODOs:

- [x] see if we can replace the `Cow` over the candidate commitments
with an `Arc` over the entire `ProspectiveCandidate`. It's only being
overwritten in unit tests. We can work around that.
- [x] finish fragment_chain unit tests
- [x] add more prospective-parachains subsystem tests
- [x] test with zombienet what happens if a parachain is creating cycles
(it should not brick the relay chain).
- [x] test with zombienet a parachain that is creating forks. it should
keep producing blocks from time to time (one bad collator should not DOS
the parachain, even if throughput decreases)
- [x] add some more logs and metrics
- [x] add prdoc and remove the "silent" label

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
github-merge-queue bot pushed a commit that referenced this pull request Aug 12, 2024
Resolves #4800

# Problem
In #4035, we removed
support for parachain forks and cycles and added support for backing
unconnected candidates (candidates for which we don't yet know the full
path to the latest included block), which is useful for elastic scaling
(parachains using multiple cores).

Removing support for backing forks turned out to be a bad idea, as there
are legitimate cases for a parachain to fork (if they have other
consensus mechanism for example, like BABE or PoW). This leads to
validators getting lower backing rewards (depending on whether they back
the winning fork or not) and a higher pressure on only the half of the
backing group (during availability-distribution for example). Since we
don't yet have approval voting rewards, backing rewards are a pretty big
deal (which may change in the future).

# Description

A backing group is now allowed to back forks. Once a candidate becomes
backed (has the minimum backing votes), we don't accept new forks unless
they adhere to the new fork selection rule (have a lower candidate
hash).
This helps with keeping the implementation simpler, since forks will
only be taken into account for candidates which are not backed yet (only
seconded).
Having this fork selection rule also helps with reducing the work
backing validators need to do, since they have a shared way of picking
the winning fork. Once they see a candidate backed, they can all decide
to back a fork and not accept new ones.
But they still accept new ones during the seconding phase (until the
backing quorum is reached).

Therefore, a block author which is not part of the backing group will
likely not even see the forks (only the winning one).

Just as before, a parachain producing forks will still not be able to
leverage elastic scaling but will still work with a single core. Also,
cycles are still not accepted.

## Some implementation details

`CandidateStorage` is no longer a subsystem-wide construct. It was
previously holding candidates from all relay chain forks and complicated
the code. Each fragment chain now holds their candidate chain and their
potential candidates. This should not increase the storage consumption
since the heavy candidate data is already wrapped in an Arc and shared.
It however allows for great simplifications and increase readability.

`FragmentChain`s are now only creating a chain with backed candidates
and the fork selection rule. As said before, `FragmentChain`s are now
also responsible for maintaining their own potential candidate storage.

Since we no longer have the subsytem-wide `CandidateStorage`, when
getting a new leaf update, we use the storage of our latest ancestor,
which may contain candidates seconded/backed that are still in scope.

When a candidate is backed, the fragment chains which hold it are
recreated (due to the fork selection rule, it could trigger a "reorg" of
the fragment chain).

I generally tried to simplify the subsystem and not introduce
unneccessary optimisations that would otherwise complicate the code and
not gain us much (fragment chains wouldn't realistically ever hold many
candidates)

TODO:
- [x] update metrics
- [x] update docs and comments
- [x] fix and add unit tests
- [x] tested with fork-producing parachain
- [x] tested with cycle-producing parachain
- [x] versi test
- [x] prdoc
alindima added a commit that referenced this pull request Aug 13, 2024
Resolves #4800

In #4035, we removed
support for parachain forks and cycles and added support for backing
unconnected candidates (candidates for which we don't yet know the full
path to the latest included block), which is useful for elastic scaling
(parachains using multiple cores).

Removing support for backing forks turned out to be a bad idea, as there
are legitimate cases for a parachain to fork (if they have other
consensus mechanism for example, like BABE or PoW). This leads to
validators getting lower backing rewards (depending on whether they back
the winning fork or not) and a higher pressure on only the half of the
backing group (during availability-distribution for example). Since we
don't yet have approval voting rewards, backing rewards are a pretty big
deal (which may change in the future).

A backing group is now allowed to back forks. Once a candidate becomes
backed (has the minimum backing votes), we don't accept new forks unless
they adhere to the new fork selection rule (have a lower candidate
hash).
This helps with keeping the implementation simpler, since forks will
only be taken into account for candidates which are not backed yet (only
seconded).
Having this fork selection rule also helps with reducing the work
backing validators need to do, since they have a shared way of picking
the winning fork. Once they see a candidate backed, they can all decide
to back a fork and not accept new ones.
But they still accept new ones during the seconding phase (until the
backing quorum is reached).

Therefore, a block author which is not part of the backing group will
likely not even see the forks (only the winning one).

Just as before, a parachain producing forks will still not be able to
leverage elastic scaling but will still work with a single core. Also,
cycles are still not accepted.

`CandidateStorage` is no longer a subsystem-wide construct. It was
previously holding candidates from all relay chain forks and complicated
the code. Each fragment chain now holds their candidate chain and their
potential candidates. This should not increase the storage consumption
since the heavy candidate data is already wrapped in an Arc and shared.
It however allows for great simplifications and increase readability.

`FragmentChain`s are now only creating a chain with backed candidates
and the fork selection rule. As said before, `FragmentChain`s are now
also responsible for maintaining their own potential candidate storage.

Since we no longer have the subsytem-wide `CandidateStorage`, when
getting a new leaf update, we use the storage of our latest ancestor,
which may contain candidates seconded/backed that are still in scope.

When a candidate is backed, the fragment chains which hold it are
recreated (due to the fork selection rule, it could trigger a "reorg" of
the fragment chain).

I generally tried to simplify the subsystem and not introduce
unneccessary optimisations that would otherwise complicate the code and
not gain us much (fragment chains wouldn't realistically ever hold many
candidates)

TODO:
- [x] update metrics
- [x] update docs and comments
- [x] fix and add unit tests
- [x] tested with fork-producing parachain
- [x] tested with cycle-producing parachain
- [x] versi test
- [x] prdoc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Backlog
6 participants