Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Bandwidth usage 10x higher with v9420 #2563

Open
tugytur opened this issue May 10, 2023 · 29 comments
Open

Bandwidth usage 10x higher with v9420 #2563

tugytur opened this issue May 10, 2023 · 29 comments

Comments

@tugytur
Copy link
Contributor

tugytur commented May 10, 2023

Bandwidth usage with the release v9420 is more than 10x higher as with v9400.

This was observed on both bootnodes and rpc nodes for the following systemchains:

  • statemint
  • statemine
  • westmint
  • collectives-polkadot
  • collectives-kusama
  • collectives-westend
  • bridgehub-kusama

These are the flags from the collectives-westend rpc node:

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcwco-001" \
        --public-addr=/ip4/X.X.X.X/tcp/30346 \
        --listen-addr=/ip4/X.X.X.X/tcp/30346 \
        --chain=collectives-westend \
        --pruning=archive \
        --database=paritydb \
        --prometheus-port=9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --no-hardware-benchmarks \
        --no-mdns \
        --no-private-ip \
        --max-runtime-instances=1 \
        --in-peers=5 \
        --out-peers=5 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --rpc-methods=safe \
        --rpc-cors=all \
        --relay-chain-rpc-urls="ws://X.X.X.X:9944" \
        --relay-chain-rpc-urls="ws://X.X.X.X:9944" \
        --base-path=/var/blockchain

Average for both up and download was on average under 1Mb/s with the flags above. After the update it went > 10 Mb/s. The decrease at the end was when we downgraded.
image

@altonen
Copy link
Contributor

altonen commented May 11, 2023

Can you check in Grafana (substrate_sub_libp2p_notifications_sizes) if there is a protocol that is obviously sending more data than before?

@tugytur
Copy link
Contributor Author

tugytur commented May 11, 2023

@altonen sure,

The results are for all rpc and bootndes that were updated.

substrate_sub_libp2p_notifications_sizes_count{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}

Right after the update the value is somewhere between 6-30
image

substrate_sub_libp2p_notifications_sizes_bucket{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}

Right after the update the value is somewhere between 6-30
image

substrate_sub_libp2p_notifications_sizes_sum{service=~"bootnode|rpc",network!~"polkadot|kusama|westend"}
image

There is no peak and some protocols were completely missing with the 9420 release.

@altonen
Copy link
Contributor

altonen commented May 15, 2023

@tugytur

I don't see any changes in these graphs. Did you compare the changes in notification data usage between versions or is it indistinguishable in these graphs you posted? I will see if I can reproduce the data usage issue locally.

Are you able to notice this only when you're running polkadot-parachain, i.e., the data usage is the same between version if you run a Polkadot binary?

/block-announces/1 is not part of the notification sizes in 0.9.42 because we had to create a hack to make it work but I'll try to add it to Prometheus metrics this week.

@bkchr
Copy link
Member

bkchr commented May 26, 2023

@tugytur any update?

@tugytur
Copy link
Contributor Author

tugytur commented May 31, 2023

@tugytur

I don't see any changes in these graphs. Did you compare the changes in notification data usage between versions or is it indistinguishable in these graphs you posted? I will see if I can reproduce the data usage issue locally.

Are you able to notice this only when you're running polkadot-parachain, i.e., the data usage is the same between version if you run a Polkadot binary?

/block-announces/1 is not part of the notification sizes in 0.9.42 because we had to create a hack to make it work but I'll try to add it to Prometheus metrics this week.

I just did a screenshot of the time period before the update was applied and after.

I checked our Relaychain RPC nodes which are on v0.9.42, I don't see this behavior on those.

@altonen
Copy link
Contributor

altonen commented May 31, 2023

@bkchr @skunert

any idea?

@altonen
Copy link
Contributor

altonen commented May 31, 2023

@tugytur
can you graph substrate_sub_libp2p_requests_in/out_success/failure_total before and after the release?

@tugytur
Copy link
Contributor Author

tugytur commented May 31, 2023

@altonen sure, This graphs are from the westend collectives nodes.
Update to 0.9.420 was at 20:52 and downgrade to 0.9.400 was at 21:33.

rate(substrate_sub_libp2p_requests_in_failure_total{instance="X"}[$__rate_interval])
image

substrate_sub_libp2p_requests_out_failure_total{instance="X"}

No Data

rate(substrate_sub_libp2p_requests_in_success_total_count{instance="X"}[$__rate_interval])
image

rate(substrate_sub_libp2p_requests_out_success_total_count{instance="X"}[$__rate_interval])
image

@skunert
Copy link
Contributor

skunert commented May 31, 2023

Which state is your node in? Do you see anything interesting in the logs?
I started a similar node for statemine and with external rpc kusama node and the machines has less than 500kb/s network usage.

@tugytur
Copy link
Contributor Author

tugytur commented May 31, 2023

Which state is your node in? Do you see anything interesting in the logs? I started a similar node for statemine and with external rpc kusama node and the machines has less than 500kb/s network usage.

I just tested it again with statemine.
Logs show nothing out of ordinary (default logging) and rpc works as intended.

Would you like me to start up with other logging settings?
Did you also startup your node with --relay-chain-rpc-urls?

image

@skunert
Copy link
Contributor

skunert commented Jun 1, 2023

The command I used was polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node, so quite similar to yours. Version was 0.9.420-c608131f0c6.

You could enable -lsync=trace and post the logs here, maybe there is something interesting.

@skunert
Copy link
Contributor

skunert commented Jun 6, 2023

I kept the node running a bit longer with the settings from my last post, so far nothing looks suspicious.
image

@altonen
Copy link
Contributor

altonen commented Jun 7, 2023

Could this have something to do with light clients? If you reduce the number of inbound light peers with --in-peers-light, does bandwidth usage decrease?

@tugytur
Copy link
Contributor Author

tugytur commented Jun 8, 2023

The command I used was polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node, so quite similar to yours. Version was 0.9.420-c608131f0c6.

You could enable -lsync=trace and post the logs here, maybe there is something interesting.

I've tested with the exact same flags as you, still have the issue. The Logs don't show anything out of the ordinary. I've also synced from scratch and the issue is still there. I noticed your 9.420 build is different from the one provided by the repo. I'll later build the main branch and try again.

Could this have something to do with light clients? If you reduce the number of inbound light peers with --in-peers-light, does bandwidth usage decrease?

Reducing the number didn't have any implication

@tugytur
Copy link
Contributor Author

tugytur commented Jun 9, 2023

@skunert

I've build the latest main branch 0.9.420-9260459eb32

After restart of the service I see the usual high traffic, but it goes down rather fast. Which wasn't the case with the repo build.

Now I see around ~1.9 MB/s for download and upload, which is still higher than before, but not as bad. With 0.9.400 we had around ~1.2 MB/s for download and upload.

It seems I'm able to drop it further with adjusting-in-peers-light, but would need to let it run for longer to be sure.

(The first white line is where I updated and the remaining white lines are where I restarted with chaning parameters)
image

Would you like me to do some other tests or should I close the issue?

@skunert
Copy link
Contributor

skunert commented Jun 12, 2023

It is strange that the traffic still looks a bit higher than before. Just making sure, this is still with these flags?
polkadot-parachain -lsync=trace --sync warp --chain=statemine --database=paritydb --prometheus-external --prometheus-port=9615 --rpc-port 9944 --max-parallel-downloads=2 --runtime-cache-size=1 --execution=wasm --wasm-execution=compiled --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" --base-path=./statemine-rpc-node
Can you upload some logs for us with -lsync=trace so we can take a look?

@altonen Any ideas what else we could check?

@altonen
Copy link
Contributor

altonen commented Jun 12, 2023

-lsync,sub-libp2p=trace would be good

I don't have any new ideas at this point but it's very strange that's for sure, maybe the logs will reveal something.

@tugytur
Copy link
Contributor Author

tugytur commented Jun 12, 2023

@skunert , I used those flags.

@altonen I've used both log flags and you can find the logs here: https://drive.google.com/file/d/177WOWTtP_DFUh9TK219sAJ3o1vg6Kjex/view?usp=sharing

For this one I used our statemine bootnode. The logs were with the provided 9.420 release from the repo polkadot-parachain_v0.9.420 0.9.420-843a5095544

Flags used were:

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="Amforc Bootnode" \
        -lsync,sub-libp2p=trace \
        --public-addr=/dns/statemine.bootnode.amforc.com/tcp/30333/wss \
        --listen-addr=/ip4/127.0.0.1/tcp/30336/ws \
        --public-addr=/ip4/213.167.226.186/tcp/30336 \
        --listen-addr=/ip4/10.10.150.51/tcp/30336 \
        --chain=statemine \
        --pruning=10 \
        --database=paritydb \
        --prometheus-port=9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --no-mdns \
        --no-private-ip \
        --no-hardware-benchmarks \
        --max-runtime-instances=1 \
        --in-peers=5 \
        --out-peers=4 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --rpc-methods=safe \
        --rpc-cors=all \
        --relay-chain-rpc-urls="ws://10.10.152.11:9944" \
        --relay-chain-rpc-urls="ws://10.10.152.12:9944" \
        --base-path=/var/blockchain

@altonen
Copy link
Contributor

altonen commented Jun 14, 2023

I can't see anything concerning in the logs. The parachain only received ~60 block announcements, each of them 650 bytes and received a handful of block response which are likely not the cause.

There was recently a report about finality stalling for some people and the logs were filled with GRANDPA messages, hundreds and hundreds of them. That could possibly explain why the bandwidth usage was high but in these logs, at least for the parachain, I can't see anything. If you want to run the old problematic version of the node again, this time with -lsync,sub-libp2p=trace for both the relay and parachain, I can take a look again. We can also close this issue if you feel like the issue has been fixed and the bandwidth usage is acceptable.

@tugytur
Copy link
Contributor Author

tugytur commented Jun 20, 2023

@altonen I've tested the latest release polkadot-parachain_v0.9.430 0.9.430-9cb14fe3cee

The issue is still there with the flags from above. But I've tested a little more and I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

I've appended the node with the following additional flags:

        --rpc-methods=safe \
        --rpc-cors=all \
        --base-path=/var/blockchain \
        --relay-chain-rpc-urls="ws://10.10.152.11:9944" \
        --relay-chain-rpc-urls="ws://10.10.152.12:9944" \
        -- \
        --chain=kusama \
        --in-peers=0 \
        --out-peers=0 \
        --no-telemetry \
        --no-mdns \
        --no-private-ip \
        --no-hardware-benchmarks \
        --in-peers-light=0

With these flags it wend from the average 15-20MB download/upload to ~2MB/s download and ~5MB/s upload. It's still quite alot compared to before but the issue could be maybe there?

@altonen
Copy link
Contributor

altonen commented Jun 21, 2023

I'll get back to this issue soon and see if I can reproduce the issue.

... I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

Is this correct behavior @skunert?

@altonen
Copy link
Contributor

altonen commented Jun 26, 2023

@tugytur

Are you able to reproduce this inn another environment? This is the only report on bandwidth we've received and I'm also unable to reproduce this. My total bandwidth is 1-3 Mbps on latest master.

Can you graph substrate_sub_libp2p_network_bytes_total and compare that to your RX/TX?

@skunert
Copy link
Contributor

skunert commented Jun 27, 2023

I'll get back to this issue soon and see if I can reproduce the issue.

... I think it's caused by the relay-chain instance that is started (despite --relay-chain-rpc flag)

Is this correct behavior @skunert?

@altonen Sorry I missed this comment. The subsystems that are started in the minimal node have network activity, but it should not be significant. I will also start one up again and see if the subsystems send anything unusual.

@skunert
Copy link
Contributor

skunert commented Jun 27, 2023

Checked again, still can not reproduce. If the relay chain node was the culprit, I would expect to see some messages from the parachain log target which I enabled this time. But there is nothing unexpected.

Some more ideas:

  1. What does network activity look like if you start a node without --relay-chain-rpc-urls. This will spawn a full polkadot node internally. But even with that I do not expect 20mbps up/download all the time. So this is worth checking to see if it makes a difference.
  2. For the node with --relay-chain-rpc-urls can you enable the -lparachain=trace,cumulus-pov-recovery=trace log targets additionally?

@bkchr
Copy link
Member

bkchr commented Jun 28, 2023

As @tugytur is running a bootnode, we should look into the stats of one of our bootnodes.

@altonen
Copy link
Contributor

altonen commented Jun 29, 2023

@tugytur

Do you experience higher syncing speeds between the versions? I'm able to reproduce the higher bandwidth usage on my other computer but while on this computer I get ~70 bps for syncing, on the other computer I get +500 bps which does account for a large portion of the used bandwidth. substrate_sub_libp2p_network_bytes_total reports TX to be around 1 KB and RX to be 1000 - 1200 KB per interval.

I'm wondering if the other traffic is accounted for by the relay chain RPC connection but I don't know how to see those numbers and I couldn't find any Prometheus metrics for it either.

@skunert
Copy link
Contributor

skunert commented Jul 5, 2023

I'm wondering if the other traffic is accounted for by the relay chain RPC connection but I don't know how to see those numbers and I couldn't find any Prometheus metrics for it either.

The traffic over the RPC connection is really low. We are listening to import and finality notifications, and thats it. There are some runtime calls we do but these are mostly for collators. But yeah I should maybe introduce some metrics for this.

@tugytur
Copy link
Contributor Author

tugytur commented Jul 5, 2023

I tested today by running the relaychain node as well instead of using the relay-chain-rpc-urls flag, still same issue.

To be sure that the issue is not caused by our own hypervisor/infrastructure I did some testing today with a ec2 instance on aws.

There were no firewall restrictions and I removed many of the flags just in case.
I did tests for all bridge-hub systemchains. The results were quite different.
the binary build is the latest available version from the repo.

ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-polkadot \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one averaged around 18Mb/s for each direction.

[Service]
User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-kusama \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://kusama-rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one averaged around 13Mb/s for each direction.

User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-westend \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://westend-rpc.polkadot.io" \
        --base-path=/var/blockchain

^This one is the interesting one. With westend the average is 2.4Mb/s for each direction.

[Service]
User=blockchain
Group= blockchain
ExecStart=/usr/local/bin/polkadot-parachain \
        --name="rpcpbh-003" \
        --chain=bridge-hub-westend \
        --database=paritydb \
        --sync=warp \
        --prometheus-port=9615 \
        --rpc-port 9944 \
        --max-parallel-downloads=2 \
        --runtime-cache-size=1 \
        --execution=wasm \
        --wasm-execution=compiled \
        --relay-chain-rpc-urls="wss://westend-rpc.polkadot.io" \
        --base-path=/var/blockchain \
        -- \
        --in-peers=0 \
        --out-peers=0 \
        --in-peers-light=0

^Here I re-added some flags and got down to 300kb/s each direction.

@altonen
Copy link
Contributor

altonen commented Jul 17, 2023

@tugytur

Is the number of peers and syncing speeds the same between your test runs across different chains? Setting inbound/outbound full/light peers to zero basically prevent the SyncingEngine from accepting any peer (except for reserved peers) so that would explain why the used bandwidth is much smaller.

If the bandwidth is explained by syncing-related block downloads, I think what's interesting is why is it so low on 0.9.40 if the peer counts are the same between the versions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants