Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow block import #13

Open
2 tasks done
purestakeoskar opened this issue Apr 18, 2023 · 91 comments
Open
2 tasks done

Slow block import #13

purestakeoskar opened this issue Apr 18, 2023 · 91 comments
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@purestakeoskar
Copy link

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

When syncing a node the block import is very slow, to the point where block production is faster than block import.
Instead of sync the logs shows Preparing.

2023-04-17 14:41:30.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373001 (31 peers), best: #3372537 (0x41ba…81b4), finalized #3372219 (0x602b…446f), ⬇ 35.2kiB/s ⬆ 4.8kiB/s
2023-04-17 14:41:34.478 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372538] 0 extrinsics, length: 10962 (normal 0%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 265621300000, proof_size: 0) (70%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:41:34.500 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372538) (0x10fe…ec7d)
2023-04-17 14:41:34.500 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0xb8de…8fa7 has 4 logs
2023-04-17 14:41:35.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373001 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 3.8kiB/s ⬆ 4.8kiB/s
2023-04-17 14:41:38.733 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:41:40.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 22.6kiB/s ⬆ 3.9kiB/s
2023-04-17 14:41:45.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 7.0kiB/s ⬆ 3.8kiB/s
2023-04-17 14:41:50.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 4.2kiB/s ⬆ 4.5kiB/s
2023-04-17 14:41:51.372 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:41:53.983 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372539] 0 extrinsics, length: 45566 (normal 1%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 357569150000, proof_size: 0) (95%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:41:54.008 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372539) (0xb8de…8fa7)
2023-04-17 14:41:54.008 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0x1835…434d has 4 logs
2023-04-17 14:41:55.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 11.9kiB/s ⬆ 3.2kiB/s
2023-04-17 14:42:00.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 10.7kiB/s ⬆ 4.1kiB/s
2023-04-17 14:42:00.390  WARN tokio-runtime-worker telemetry: [🌗] ❌ Error while dialing /dns/telemetry.polkadot.io/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Timeout }
2023-04-17 14:42:05.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 11.4kiB/s ⬆ 3.0kiB/s
2023-04-17 14:42:06.384 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:42:09.607 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372540] 0 extrinsics, length: 26287 (normal 0%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 327287250000, proof_size: 0) (87%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:42:09.632 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372540) (0x1835…434d)
2023-04-17 14:42:09.632 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0x620c…caa3 has 4 logs
2023-04-17 14:42:10.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373004 (31 peers), best: #3372540 (0x1835…434d), finalized #3372219 (0x602b…446f), ⬇ 29.9kiB/s ⬆ 4.6kiB/s
2023-04-17 14:42:15.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373004 (31 peers), best: #3372540 (0x1835…434d), finalized #3372219 (0x602b…446f), ⬇ 20.8kiB/s ⬆ 4.2kiB/s

The node is connected to peers with blocks we need.

2023-04-17 14:45:38.109  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373020 (18 peers), best: #3372550 (0x18a5…4ca5), finalized #3372219 (0x602b…446f), ⬇ 5.2kiB/s ⬆ 2.3kiB/s
2023-04-17 14:45:39.845 TRACE tokio-runtime-worker sync: [🌗] New peer 12D3KooWH3nhNXgsiPhVREEcrbbWwVyNfuLnwfrgwdkTPKjtHS2N BlockAnnouncesHandshake { roles: FULL, best_number: 3373019, best_hash: 0x2bb9696090704a6ca516b482a1661177894ba7a63f1dffa3d958edb2d19a980f, genesis_hash: 0xfe58ea77779b7abda7da4ec526d14db9b1e9cd40a217c34892af80a9b332b76d }
2023-04-17 14:45:39.845 DEBUG tokio-runtime-worker sync: [🌗] New peer with known best hash 0x2bb9…980f (3373019).
2023-04-17 14:45:39.845 DEBUG tokio-runtime-worker sync: [🌗] Connected 12D3KooWH3nhNXgsiPhVREEcrbbWwVyNfuLnwfrgwdkTPKjtHS2N
2023-04-17 14:45:41.464 TRACE tokio-runtime-worker sync: [🌗] New peer 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F BlockAnnouncesHandshake { roles: FULL, best_number: 3373019, best_hash: 0x2bb9696090704a6ca516b482a1661177894ba7a63f1dffa3d958edb2d19a980f, genesis_hash: 0xfe58ea77779b7abda7da4ec526d14db9b1e9cd40a217c34892af80a9b332b76d }
2023-04-17 14:45:41.464 DEBUG tokio-runtime-worker sync: [🌗] New peer with known best hash 0x2bb9…980f (3373019).
2023-04-17 14:45:41.464 DEBUG tokio-runtime-worker sync: [🌗] Connected 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F
2023-04-17 14:45:41.474 TRACE tokio-runtime-worker sync: [🌗] 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F Ignoring transactions while major syncing

There are also queued synced blocks (sync_queued_blocks metric)
image

Another interesting note is that the node does not know how far it needs to sync. Its sync_target is equal to best_block
image

Steps to reproduce

Start syncing a moonbem or moonriver node with archive parachain and pruned relay chain
We are running moonbeam version 0.30.3 ( using polkadot 0.9.37 )
Main flags used

--execution=wasm
--pruning=archive
--rpc-cors=all
--unsafe-rpc-external
--unsafe-ws-external
--rpc-methods=safe
--db-cache=1024
--trie-cache-size=1073741824
--runtime-cache-size=32
--eth-log-block-cache=1000000000
--eth-statuses-cache=1000000000
--detailed-log-output
--no-hardware-benchmarks
--ws-max-connections=800
@altonen
Copy link
Contributor

altonen commented Apr 19, 2023

Can you run the node with -lsync=trace? Are you able to reproduce this issue reliable every time you run the node?

@purestakeoskar
Copy link
Author

@altonen thanks for your response.
Yes this i reproducible, every time i restart it goes back to the same state and it happens on multiple nodes.
This is 5min of sync=trace logs: moonbeam-sync-trace.log

@altonen
Copy link
Contributor

altonen commented Apr 20, 2023

From the logs it looks like there are no block announcements coming for the relay chain so that could possibly explain why the target block is not progressing. And since for most peers you're connected the best block is not progressing because either a) they're major syncing as well and don't announce blocks or b) the inbound /block-announces/1 is closed and it's not detected by your node, your node is making very little progress.

Why it takes a minute between importing block 15159517 and 15159518 is weird and I don't know why that would be happening, any idea @bkchr?

@purestakeoskar
Copy link
Author

Just for clarification, the relay chain in question is Polkadot.

@bkchr
Copy link
Member

bkchr commented Apr 20, 2023

Why it takes a minute between importing block 15159517 and 15159518 is weird and I don't know why that would be happening, any idea @bkchr?

No. It started directly the checking of the header of 15159518, but as you said it took over a minute to import it. Not sure what happens here.

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

@purestakeoskar is this reproducible for you on v0.9.36?

@bLd75
Copy link

bLd75 commented Apr 27, 2023

Bringing in we have the same problem on Astar with an http endpoint which is heavily queried, tested on both 9933 and 9944.
The abnormal behavior strangely feels like more than 1 year ago when RPC nodes started to lose sync while heavily queried. CPU usage is > 60%.
Client is on v0.9.39 and runtime on v0.9.37

@purestakeoskar
Copy link
Author

@purestakeoskar is this reproducible for you on v0.9.36?

I do not have a client with v0.9.36 to test with unfortunately. Is there any change in particular in 0.9.37 that stands out?

@skunert
Copy link
Contributor

skunert commented Apr 28, 2023

@purestakeoskar Does this node use paritydb or rocksdb? And does this problem persist forever or are block imports normal again after some time? Sync mode normal?
I want to reproduce this.

@purestakeoskar
Copy link
Author

purestakeoskar commented Apr 28, 2023

@skunert The node is using rocksdb, and the blocks does slowly import, some faster than others, the syncing does eventually finish and after that the node follows the chain.

Another interesting note (this might be a separate issue). when the node restarts the grandpa sessions takes a while to be established on the relay side, so we dont see blocks finalizing for sometimes up to 15 minutes. But then when the grandpa notification streams are open we start to see the finalization moving again. During this time block import still happens.

Also to note, we have the same behaviour when we use external RPC for the relay chain.

@jasl
Copy link
Contributor

jasl commented Apr 30, 2023

@purestakeoskar Does this node use paritydb or rocksdb? And does this problem persist forever or are block imports normal again after some time? Sync mode normal? I want to reproduce this.

Perhaps you can reproduce the issue use Khala network, we're on the latest Polkadot v0.9.41, can both reproduce no matter ParityDB or RocksDB

docker run -dti --name khala-node \
	-e NODE_NAME=khala-node \
	-e NODE_ROLE=FULL \
	-e PARACHAIN_DB="paritydb" \
	-e RELAYCHAIN_DB="paritydb" \
	-e PARACHAIN_EXTRA_ARGS='--ws-max-connections 512 --rpc-max-response-size 64 --runtime-cache-size 16 --no-telemetry' \
	-e RELAYCHAIN_EXTRA_ARGS='--ws-max-connections 512 --rpc-max-response-size 64 --runtime-cache-size 16 --no-telemetry' \
	--network host \
	--restart always \
	--log-driver local --log-opt max-size=100m --log-opt max-file=3 \
	-v /data/khala-node:/root/data \
	phalanetwork/khala-node-with-launcher:v0.1.24-1

You should see a significant sync speed regression when the progress reaches ~50%

Some users will experience the sync speed reduction to 0 bps (both Parachain and Relaychain), some lucky users will have ~ 50bps (in my experience)

@jasl
Copy link
Contributor

jasl commented Apr 30, 2023

We left some comments on paritytech/substrate#9360

@arkpar
Copy link
Member

arkpar commented May 2, 2023

You can use https://github.com/koute/not-perf for profiling

# once the node is stuck and shows 0 bps with busy CPU, let nperf run for ~20 seconds and quit with Ctrl-C
nperf record -p $PID_OF_NODE_PROCESS -o datafile

nperf flamegraph datafile > flame.svg

@jasl
Copy link
Contributor

jasl commented May 3, 2023

@arkpar Here it is, the node uses the default arguments, recording when sync is slow

I see [2023-05-02T23:42:11Z ERROR nperf_core::profiler] Failed to load '/memfd:wasm-memory-image (deleted)' from "/memfd:wasm-memory-image (deleted)": No such file or directory (os error 2) not sure it impacts the result

sync2

sync

flaminggraph.zip

@arkpar
Copy link
Member

arkpar commented May 3, 2023

There seem to be a lot of time wasted in cumulus_client_consensus_common::parachain_consensus::follow_finalized_head
In particular, there's a lot of header queries that don't hit cache for some reason.
@bkchr

@bkchr
Copy link
Member

bkchr commented May 3, 2023

Yeah, @skunert is already working on this: paritytech/cumulus#2495

@skunert
Copy link
Contributor

skunert commented May 3, 2023

Yes, the problem are finality notifications that reach the node while syncing when the finalized block is far away from the tip of the chain.

For each of these finalization attempts, we calculate multiple tree routes. One from last finalized -> new finalized and one from new_finalized -> best_block. This last route is very long in the collator case and takes up to 15 seconds on my machine. During this time, the import lock is taken, and no other import operations are possible. We do not hit the cache since the distance between finalized and best can be millions of blocks.

I see the following options:

  1. Cache the tree route from finalized -> best block and verify that the new finalized block is on the correct route (looking into this currently)
  2. Move tree route computation out of the import lock for finalization and verify only that this computed route is still valid
  3. Disable collator finalization during major sync. This is the easiest fix but will lead to a multi-minute finalization operation from genesis to tip once fully synced.

@arkpar Any opinions on this?

@arkpar
Copy link
Member

arkpar commented May 3, 2023

So this looks like paritytech/substrate#9360, p.2

For each of these finalization attempts, we calculate multiple tree routes. One from last finalized -> new finalized and one from new_finalized -> best_block.

What is that last route used for? Maybe we can make some assumptions if the difference in block numbers is over some threshold?

Disable collator finalization during major sync. This is the easiest fix but will lead to a multi-minute finalization operation from genesis to tip once fully synced.

In theory, we can finalize any chain in constant time, without having to check all the intermediate blocks. IIRC we only query tree routes to get a list of discarded blocks and actually discard them. This list should be empty after the initial sync anyway. Implementing this at the database level could be a bit involved. state-db in particular requires finalizing blocks one at a time, with the exception of warp sync.

@bkchr
Copy link
Member

bkchr commented May 3, 2023

We could check the number of leaves. If the number of leaves is one, we don't need to calculate the tree route as there isn't any other chain. This should also hold for the initial sync and should give us the speed up we need.

@skunert
Copy link
Contributor

skunert commented May 3, 2023

We could check the number of leaves. If the number of leaves is one, we don't need to calculate the tree route as there isn't any other chain. This should also hold for the initial sync and should give us the speed up we need.

This sounds reasonable, will give it a try.

@arkpar
Copy link
Member

arkpar commented May 3, 2023

We could check the number of leaves. If the number of leaves is one, we don't need to calculate the tree route as there isn't any other chain. This should also hold for the initial sync and should give us the speed up we need.

This sounds reasonable for this particular issue, but it won't fix the problem reported in paritytech/cumulus#2495 There will still be a million blocks finalized at once.

@skunert
Copy link
Contributor

skunert commented May 3, 2023

We could check the number of leaves. If the number of leaves is one, we don't need to calculate the tree route as there isn't any other chain. This should also hold for the initial sync and should give us the speed up we need.

This sounds reasonable for this particular issue, but it won't fix the problem reported in paritytech/cumulus#2495 There will still be a million blocks finalized at once.

True, but this issue here is particularly annoying since this blocks the entire syncing. paritytech/cumulus#2495 is for sure still problematic but will only stall the node for a few minutes before it continues operation normally. We should look into further improvements.

@skunert
Copy link
Contributor

skunert commented May 5, 2023

This change is now in cumulus, closing here.

@skunert skunert closed this as completed May 5, 2023
@jasl
Copy link
Contributor

jasl commented May 14, 2023

I just made a new node based on polkadot-v0.9.42 and did a resync on my Azure VM (which stuck sync when reaching ~50% progress),
Unfortunately, I still see the node stuck

2023-05-14 13:27:50 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909361 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 367.0kiB/s ⬆ 260.9kiB/s
2023-05-14 13:27:53 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885948 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 5.5kiB/s ⬆ 6.0kiB/s
2023-05-14 13:27:55 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909361 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 303.7kiB/s ⬆ 229.6kiB/s
2023-05-14 13:27:58 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (39 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 30.4kiB/s ⬆ 6.0kiB/s
2023-05-14 13:28:00 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909362 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 598.7kiB/s ⬆ 222.9kiB/s
2023-05-14 13:28:03 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 11.1kiB/s ⬆ 7.3kiB/s
2023-05-14 13:28:05 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909363 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 208.4kiB/s ⬆ 194.0kiB/s
2023-05-14 13:28:08 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 43.7kiB/s ⬆ 9.3kiB/s
2023-05-14 13:28:10 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909364 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 284.6kiB/s ⬆ 183.3kiB/s
2023-05-14 13:28:13 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (39 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 27.1kiB/s ⬆ 5.6kiB/s
2023-05-14 13:28:15 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909365 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 230.8kiB/s ⬆ 202.6kiB/s
2023-05-14 13:28:18 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 34.1kiB/s ⬆ 8.1kiB/s
2023-05-14 13:28:20 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909366 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 285.0kiB/s ⬆ 232.5kiB/s
2023-05-14 13:28:23 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 8.2kiB/s ⬆ 6.7kiB/s
2023-05-14 13:28:25 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909367 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 269.0kiB/s ⬆ 225.0kiB/s
2023-05-14 13:28:28 [Parachain] ⚙️  Syncing  0.0 bps, target=#3885949 (40 peers), best: #689249 (0x1ad0…a551), finalized #0 (0xd435…be8d), ⬇ 42.2kiB/s ⬆ 10.2kiB/s
2023-05-14 13:28:30 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17909368 (40 peers), best: #2732144 (0x618b…c94a), finalized #2732032 (0x0071…6b63), ⬇ 252.5kiB/s ⬆ 211.8kiB/s

I'll try record nperf tomorrow

BTW, I see that node can't stop gracefully (so does previous)

docker stop khala-node
Error response from daemon: cannot stop container: khala-node: tried to kill container, but did not receive an exit event

@jasl
Copy link
Contributor

jasl commented Jun 1, 2023

@bkchr I think txpool has trouble, here is the log, I only run the node < 5 min, that generates 8GB log

https://storage.googleapis.com/phala-misc/trace_txpool.log.zip

NOTE: I must confess in the early day Khala online, our pallet has serious bugs we have to clear our collators' txpool for mitigate.

But we found sync Phala (on Polkadot) has the same trouble which old problems have already been fixed properly.

For now I don't have Phala node so I can't share info yet, but I'll do next week

@arkpar
Copy link
Member

arkpar commented Jun 1, 2023

The log is huge because tree_route is called mutiple times for a long route and the whole thing is just dumped into the log.

2023-06-01 11:40:55.922 DEBUG tokio-runtime-worker txpool: [Parachain] resolve hash:0xfd9bc5828f075da985eefc978d6da257b4a435c5cc8b8495b67a7d2f853b50e5 finalized:true tree_route:TreeRoute { route: [HashAndNumber { number: 3242935, .... 

There was a PR(paritytech/substrate#13004) to prevent this, but apparently it does not help much.

@skunert
Copy link
Contributor

skunert commented Jun 1, 2023

In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every justification_period block or at the end of the era.

@jasl However I found that in khala the justification_period is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.

@jasl
Copy link
Contributor

jasl commented Jun 1, 2023

In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every justification_period block or at the end of the era.

@jasl However I found that in khala the justification_period is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.

Yeah, we patched the justification_period for historical reasons, do you mean the txpool will be processed when generating finality justification? if we revert to default value 512 we would see each 512 blocks process txpool which reduce the pressure of node

@skunert
Copy link
Contributor

skunert commented Jun 1, 2023

I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.

@jasl
Copy link
Contributor

jasl commented Jun 1, 2023

I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

But I still don't quite understand:
Do you mean txpool awakens by finality notifications and this is required for justification verification?
or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

PS: I'm too noob in this domain, is there any doc I can learn?

@bkchr
Copy link
Member

bkchr commented Jun 1, 2023

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

There is a RPC to proof finality that you can use.

Do you mean txpool awakens by finality notifications and this is required for justification verification?
or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

paritytech/substrate#14285 this will improve the situation. However, I also think that there is maybe still more to it.

@jasl
Copy link
Contributor

jasl commented Jun 2, 2023

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

There is a RPC to proof finality that you can use.

Do you mean txpool awakens by finality notifications and this is required for justification verification?
or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

paritytech/substrate#14285 this will improve the situation. However, I also think that there is maybe still more to it.

Thank you, I can backport your PR to our node to test

@skunert
Copy link
Contributor

skunert commented Jun 2, 2023

Do you mean txpool awakens by finality notifications and this is required for justification verification?
or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

Transaction pool listens for these notifications and does some maintenance work. We have logic in place to skip this based on block distances. However when the node is finalizing every block, this maintenance is triggered too often.
But as I said, there are other problems as well. Finalization of each block itself also takes more resources than finalizing in batches.

Even if you disable this now, your existing nodes will have the justifications in their db. Ideally, the node would not finalize for every incoming justification during sync, so that will probably change.

@jasl
Copy link
Contributor

jasl commented Jun 3, 2023

I've backported paritytech/substrate#14285 to our Khala node
Here are new flame graphs

Short (run about 20 s)
sync_short
sync_short.svg.zip

Long (run about 1.5 min)
sync_long
sync_long.svg.zip

@bkchr
Copy link
Member

bkchr commented Jun 4, 2023

@jasl can you please share the exact CLI args you are using to spawn your node?

This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).

@jasl
Copy link
Contributor

jasl commented Jun 4, 2023

@jasl can you please share the exact CLI args you are using to spawn your node?

This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).

./khala-node \
  --chain khala \
  --base-path $DATA_PATH \
  --name $NODE_NAME \
  --port 30333 \
  --prometheus-port 9615 \
  --rpc-port 9933 \
  --ws-port 9944 \
  --database paritydb \
  --no-hardware-benchmarks \
  --no-telemetry \
  --rpc-max-response-size 64 \
  --max-runtime-instances 16 \
  --runtime-cache-size 8 \
  --state-pruning archive-canonical \
  --blocks-pruning archive-canonical \
  -- \
  --chain kusama \
  --port 30334 \
  --prometheus-port 9616 \
  --rpc-port 9934 \
  --ws-port 9945 \
  --database paritydb \
  --no-hardware-benchmarks \
  --no-telemetry \
  --rpc-max-response-size 64 \
  --max-runtime-instances 16 \
  --runtime-cache-size 8 \
  --state-pruning archive-canonical \
  --blocks-pruning archive-canonical

(which is wrong and should be reverted!).

Thank you, now I understand how bad it would be if we set a very short justification_period, I'll forward this warning to our team.

@skunert
Copy link
Contributor

skunert commented Jun 5, 2023

@jasl If you want you could try to run your node with --relay-chain-rpc-url <external-relay-chain-rpc> and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.

@jasl
Copy link
Contributor

jasl commented Jun 5, 2023

@jasl If you want you could try to run your node with --relay-chain-rpc-url <external-relay-chain-rpc> and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.

Our collators are using --relay-chain-rpc-url <external-relay-chain-rpc>, and seem not to see any sync issue, I'll try for our normal nodes later.

But running 2 apps in one Docker container is not recommended, so I would like to test, for end-users I shall waiting the proper fix

@bLd75
Copy link

bLd75 commented Jun 14, 2023

Having here the same behaviour on an Astar node.
It's been tested with relay RPC sync too but the issue is the same.

It's important to note that it happens only on RPC under high number of requests.
On the screenshot below we can see the same behavior, node loses sights of head of the chain pretty quickly after start, and this gap increases over time. While finalization is getting completely stuck.
Node gets restarted after an hour (18:15 on the graph), resyncing with the right chain height, then losing it again.
image

@skunert
Copy link
Contributor

skunert commented Jun 19, 2023

@bLd75 Can you provide logs with -lsync=trace for the time period where the issue appears? The issues we had previously discussed in the comments should not occur when you run with an external relay chain node.

@crystalin
Copy link

This is still an issue on Polkadot 0.9.43:
Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever.
sync.log

@crystalin
Copy link

crystalin commented Jul 19, 2023

Additional information. It seems that the import is limited by the Disk IOPS (this is specially the case in Cloud services providing low default IOPS).
Surprisingly, even when sync is at 0.0/s it still is using the max IOPS (3000)

I think something changed in recent version (probably around 0.9.37) which is significantly increasing the IOPS reaching a limit where it snowballs and block most of the processes.

Ex:

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        3032.20     15671.60         0.00         0.00     156716          0          0

Increasing it (realtime) to 10k IOPS allows to restore some bps but it quickly goes down to 0.2 or 0.0 bps. I think something is inefficient and snowballing.

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        10114.10     52228.00       356.00         0.00     522280       3560          0
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [Relaychain] ⚙️  Syncing  0.2 bps, target=#18838858 (43 peers), best: #14447387 (0x7cb5…d01f), finalized #14447104 (0x8c38…91fb), ⬇ 517.3kiB/s  449.1kiB/s
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [🌗] ⚙️  Syncing  0.2 bps, target=#4697322 (27 peers), best: #3451127 (0x41cd…1d03), finalized #2565171 (0xb87f…89bb), ⬇ 8.2kiB/s  1.4kiB/s

=== Experiment 1: rocksdb from scratch ===
I tried sync from scratch using rocks db:

Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [Relaychain] ⚙️  Syncing 368.8 bps, target=#16465081 (1 peers), best: #19264 (0x4555…7066), finalized #18944 (0x48b9…c40b), ⬇ 210.2kiB/s ⬆ 28.6kiB/s
Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [🌗] ⚙️  Syncing 131.8 bps, target=#4025691 (1 peers), best: paritytech/substrate#5315 (0x7cea…36da), finalized #0 (0xfe58…b76d), ⬇ 510.3kiB/s ⬆ 0.3kiB/s

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1          74.20        26.40     17532.80         0.00        264     175328          0

(for 370 bps, IOPS is only ~100, however the chain/state was empty as it was the beginning of the sync)

=== Experiment 2: paritydb from scratch ===

Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [Relaychain] ⚙️  Syncing 639.2 bps, target=#16465119 (1 peers), best: #46670 (0xf6a7…84f9), finalized #46592 (0x1456…92dd), ⬇ 261.1kiB/s ⬆ 13.7kiB/s
Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [🌗] ⚙️  Syncing 152.2 bps, target=#4025708 (1 peers), best: paritytech/substrate#11722 (0x8186…cf93), finalized #0 (0xfe58…b76d), ⬇ 465.1kiB/s ⬆ 0.3kiB/s

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        3137.40        24.80     55084.00         0.00        248     550840          0

(for 370 bps, IOPS is ~3000 which is a lot compared to rocksdb)

@jasl
Copy link
Contributor

jasl commented Jul 19, 2023

This is still an issue on Polkadot 0.9.43: Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever. sync.log

paritytech/substrate#14285 seems not bundled in 0.9.43, but in 0.9.44.
paritytech/substrate#14423 have to wait .45

You have to backport it by yourself. we forked 0.9.43 and cherry-pick these changes.
here's our sample https://github.com/Phala-Network/khala-parachain/blob/main/Cargo.toml#L60-L65

@crystalin
Copy link

Thank you, I'll try those

@lexnv
Copy link
Contributor

lexnv commented Jul 9, 2024

@jasl @crystalin Do you still get this issue with the latest release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: backlog
Development

No branches or pull requests