research and refactor syncing algorithm #1659

noot · 2021-06-24T19:27:04Z

Task summary

refactor syncing algorithm to not necessarily live in the network package, but to live in the sync package
this will require updating the API that the network package exposes, will need to add a function to send a request and receive a response over request/response streams (eg. the sync substream)
the current sync algorithm uses a queue of block requests that get sent out in succession
responses are sent to the sync package for processing
there are 2 "modes" that need to be handled better: firstly, syncing an established chain, and secondly syncing while near the head of the chain.
this will require some research into sync algorithms used by other nodes to figure out a good algorithm

The text was updated successfully, but these errors were encountered:

arijitAD · 2021-09-01T15:02:43Z

I investigated the syncing and found a couple of issues:

1. HandleTransactionMessage()

Issue Description: After execution of nth block (n+1)th block throws below error failed to find root key.

WARN[09-01|02:37:45] failed to load state for block           pkg=sync block=0xdb367756f79d5269fbf4443b9b3cc09599121adb911abaacc585f533eae5dfd1 error="failed to find root key=0xd9dd920475455acfb5f8a3d7396586ed5d00311e54ea4ac80f7375862ab34023: Key not found" caller=syncer.go:203

Analysis: There is parallel execution of HandleTransactionMessage() and handleblock() for nth block while syncing. The trie state is getting modified by HandleTransactionMessage() by (n-1)th block. HandelBlockImport() will store incorrect of data (n-1)th block trie in DB instead of nth block trie and import nth block successfully. On executing (n+1)th block throws an error in TrieState() due to absence of nth trie storage in db.

Suggestion: We should acquire the lock on storageState at all places where we are setting storage context for runtime. But this may cause heavy lock contention since transaction messages are broadcasted frequently.
Instead, we should batch process transaction messages after every regular time interval.

Note: Created a new issue for this #1781

2. Issue due to Failed to call the `Core_execute_block` exported function at some x block. This seems to be the reason why Kusama syncing fails #1770

EROR[09-01|16:30:04] failed to handle block                   pkg=sync number=14383 error="failed to execute block 14383: Failed to call the `Core_execute_block` exported function." caller=syncer.go:268

Analysis: On failure while syncing block, handleBlockDataFailure() will reset requestData map and push request for q.current. The will lead to fetch block responses from q.current to q.current+128.
The issue is when we have more than 128 responses while executing blocks in handleBlock() and if we get error in some block after 128 index block (nth block). handleBlockDataFailure() will push request for q.start to q.start+128 blocks. This range doesn't include the block which got failed and will never fetch the block response for the required failed block.

Also, handleResponseQueue() will try to pushRequest() but since we had data previously for our block data range (n-(n%128)+1)th to [(n-(n%128)+1)+128]th blocks, the requestData map have old values (data.sent: true ,data.received: true). This will ever trigger get get blocks data for above range.

3. handleBlockAnnounce()

Analysis: handleBlockAnnounce() doesn't have any mechanism to check the redundant requests and immediately we are pushing the request for the response. This will leads to fetch response for the blocks which is nearby 9 million and including to the block response for executing blocks. These blocks will always leads to failed to handle block data due to parent key not found error="failed to find root key.
handleBlockDataFailure() will create request for parent block but that too belongs to 9 million range. Thus, nowhere during handeling blocks we are creating request for block after head.
syncAtHead() will trigger request for block after head and syncing will continue.
Due to continuous announce block data this will happen continuously and also slows down the syncing process.
Since requestch channel is having a buffer size of 6. This might block the actual syncing process also.

github-actions · 2021-12-03T21:20:20Z

🎉 This issue has been resolved in version 0.6.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

noot added network labels Jun 24, 2021

dutterbutter removed the Status: Needs Triage label Jul 22, 2021

dutterbutter assigned noot Jul 22, 2021

noot mentioned this issue Aug 25, 2021

Redesign block sync #1742

Closed

noot added p1 and removed p2 labels Sep 3, 2021

noot mentioned this issue Sep 7, 2021

Kusama Node not syncing shortly after launch #1770

Closed

dutterbutter added the w3f approved label Sep 14, 2021

noot mentioned this issue Sep 14, 2021

maintainence: refactor syncing algorithm, implement bootstrap syncing #1787

Merged

noot mentioned this issue Sep 25, 2021

maintenance: continue sync refactor, implement tip sync mode #1812

Merged

noot mentioned this issue Oct 11, 2021

feature(dot/sync): implement *tipSyncer.hasCurrentWorker, add minPeers, slotDuration to chainSync, check if syncing in *core.Service.HandleTransactionMessage #1881

Merged

noot closed this as completed in #1881 Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research and refactor syncing algorithm #1659

research and refactor syncing algorithm #1659

noot commented Jun 24, 2021

arijitAD commented Sep 1, 2021 •

edited

Loading

github-actions bot commented Dec 3, 2021

research and refactor syncing algorithm #1659

research and refactor syncing algorithm #1659

Comments

noot commented Jun 24, 2021

Task summary

arijitAD commented Sep 1, 2021 • edited Loading

1. HandleTransactionMessage()

2. Issue due to Failed to call the Core_execute_block exported function at some x block. This seems to be the reason why Kusama syncing fails #1770

3. handleBlockAnnounce()

github-actions bot commented Dec 3, 2021

arijitAD commented Sep 1, 2021 •

edited

Loading

2. Issue due to Failed to call the `Core_execute_block` exported function at some x block. This seems to be the reason why Kusama syncing fails #1770