Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Fix missing block number issue on forced canonicalization #12949

Merged
merged 5 commits into from
Dec 16, 2022

Conversation

bkchr
Copy link
Member

@bkchr bkchr commented Dec 15, 2022

There is this issue about missing block numbers on forced canonicalization. I looked over the code now 10000 times and there are possible ways this can be triggered, but I don't really know how this is triggered. So, this pr is going to solve the symptom and not the cause. The block number to hash mapping is set when we import a new best block. Forced canonicalization will now stop at the best block and it will canonicalize the other blocks later when the best block moved. As the error reports indicated that this issue mainly happened on major sync, there should not be any forks, so not doing the canonicalization directly shouldn't be that harmful. All known implementations should import all blocks as best block on major sync anyway (I mean somewhere there is the bug, but I didn't yet found it).

I will also do some changes to Cumulus around some potential culprit for this issue.

Closes: #12613

There is this issue about missing block numbers on forced canonicalization. I looked over the code
now 10000 times and there are possible ways this can be triggered, but I don't really know how this
is triggered. So, this pr is going to solve the symptom and not the cause. The block number to hash
mapping is set when we import a new best block. Forced canonicalization will now stop at the best
block and it will canonicalize the other blocks later when the best block moved. As the error
reports indicated that this issue mainly happened on major sync, there should not be any forks, so
not doing the canonicalization directly shouldn't be that harmful. All known implementations should
import all blocks as best block on major sync anyway (I mean somewhere there is the bug, but I
didn't yet found it).

I will also do some changes to Cumulus around some potential culprit for this issue.

Closes: #12613
@bkchr bkchr added A0-please_review Pull request needs code review. B5-clientnoteworthy C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Dec 15, 2022
@bkchr bkchr requested review from cheme and arkpar December 15, 2022 23:07
Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling force canonicalize on the whole range of blocks sounds like a good idea to me, but on this decision I might be missing some subtleties.
a bit out of context, but there is one thing I sometime would find useful when working on these part of code, would be a better fork-tree (storing parent-child relation bidirectional and managed from client db with a good caching layer), not strictly needed but would make some things generally simpler and help debugging. (mentioning it as in the past I remember trying to implement that in some experimental branch)

client/db/src/lib.rs Show resolved Hide resolved
client/db/src/lib.rs Outdated Show resolved Hide resolved
@arkpar
Copy link
Member

arkpar commented Dec 16, 2022

IMO it would be simpler for force_delayed_canonicalize to not accept a block hash/number at all. Simply call state_db.canonicalize() while state_db.best_canonical() returns a block that's further than canonicalization_delay from the best block.

@bkchr
Copy link
Member Author

bkchr commented Dec 16, 2022

IMO it would be simpler for force_delayed_canonicalize to not accept a block hash/number at all. Simply call state_db.canonicalize() while state_db.best_canonical() returns a block that's further than canonicalization_delay from the best block.

But then canonicalization_delay == 0 will not work. However, I like the general idea

@bkchr
Copy link
Member Author

bkchr commented Dec 16, 2022

Okay, I just checked the state_db. There is no canonicalize function?

@arkpar
Copy link
Member

arkpar commented Dec 16, 2022

Sorry, I mean canonicalize_block
canonicalization_delay == 0 makes no sense in real life. It means that any imported block should be canonicalized immediately so no re-orgs are possible. The forced canonicalization is a mechanism to prevent state build-up in memory when finality stalls for whatever reason. The delay value should be high enough to allow a plausible re-org and low enough to prevent out-of-memory situation.

@bkchr
Copy link
Member Author

bkchr commented Dec 16, 2022

Sorry, I mean canonicalize_block canonicalization_delay == 0 makes no sense in real life.

Yeah, I just wanted to hint this :P

let best_canonical = || self.storage.state_db.best_canonical().unwrap_or(0);
let info = self.blockchain.info();
let best_number: u64 = self.blockchain.info().best_number.saturated_into();
let best_hash = info.best_hash;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved to line 1332 below, or removed entirely

@bkchr
Copy link
Member Author

bkchr commented Dec 16, 2022

bot merge

@paritytech-processbot
Copy link

Waiting for commit status.

@paritytech-processbot paritytech-processbot bot merged commit 85f660f into master Dec 16, 2022
@paritytech-processbot paritytech-processbot bot deleted the bkchr-db-fix-canonicalize branch December 16, 2022 14:01
jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Dec 20, 2022
jasl added a commit to Phala-Network/khala-parachain that referenced this pull request Dec 30, 2022
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-release-analysis-v0-9-37/1736/1

ltfschoen pushed a commit to ltfschoen/substrate that referenced this pull request Feb 22, 2023
…#12949)

* Fix missing block number issue on forced canonicalization

There is this issue about missing block numbers on forced canonicalization. I looked over the code
now 10000 times and there are possible ways this can be triggered, but I don't really know how this
is triggered. So, this pr is going to solve the symptom and not the cause. The block number to hash
mapping is set when we import a new best block. Forced canonicalization will now stop at the best
block and it will canonicalize the other blocks later when the best block moved. As the error
reports indicated that this issue mainly happened on major sync, there should not be any forks, so
not doing the canonicalization directly shouldn't be that harmful. All known implementations should
import all blocks as best block on major sync anyway (I mean somewhere there is the bug, but I
didn't yet found it).

I will also do some changes to Cumulus around some potential culprit for this issue.

Closes: paritytech#12613

* Add some docs

* Fix fix

* Review comments

* Review comments
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
…#12949)

* Fix missing block number issue on forced canonicalization

There is this issue about missing block numbers on forced canonicalization. I looked over the code
now 10000 times and there are possible ways this can be triggered, but I don't really know how this
is triggered. So, this pr is going to solve the symptom and not the cause. The block number to hash
mapping is set when we import a new best block. Forced canonicalization will now stop at the best
block and it will canonicalize the other blocks later when the best block moved. As the error
reports indicated that this issue mainly happened on major sync, there should not be any forks, so
not doing the canonicalization directly shouldn't be that harmful. All known implementations should
import all blocks as best block on major sync anyway (I mean somewhere there is the bug, but I
didn't yet found it).

I will also do some changes to Cumulus around some potential culprit for this issue.

Closes: paritytech#12613

* Add some docs

* Fix fix

* Review comments

* Review comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
None yet
4 participants