Libp2p stop never resolves when shutting down node #6053

nflaig · 2023-10-19T12:04:38Z

Describe the bug

There seems to be an issue since we have updated libp2p in #6015. In some cases, the beacon node never shuts down and just hangs indefinitely. This does not happen all the time but if the node is running for a longer time it seems to happen quite frequently.

Based on the logs it is clear that the issue is await this.libp2p.stop(); which just never resolves

lodestar/packages/beacon-node/src/network/core/networkCore.ts

Line 271 in c50db8f

await this.libp2p.stop();

See libp2p-shutdown-issue-full.log

devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "closing network core running in network worker"
Oct-19 11:40:56.818[network]         debug: closing network core running in network worker
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network sent goodbye to all peers"
Oct-19 11:40:57.826[network]         debug: network sent goodbye to all peers
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network peerManager closed"
Oct-19 11:40:57.838[network]         debug: network peerManager closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network reqResp closed"
Oct-19 11:40:57.863[network]         debug: network reqResp closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network lib2p closed"
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "terminating network worker"

This was also brought up by Sea Monkey on discord, corresponding logs sea-monkey-shutdown-issue.log

Expected behavior

Libp2p stop should always resolve in a timely manner

Steps to reproduce

Run beacon node on any network

Additional context

Possibly related to fix: ensure all listeners are properly closed on tcp shutdown libp2p/js-libp2p#2058

Operating system

Linux

Lodestar version or commit hash

unstable (ab2dfdd)

The text was updated successfully, but these errors were encountered:

nazarhussain · 2023-10-19T12:30:33Z

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

nflaig · 2023-10-19T12:34:10Z

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

I don't think it is related to specific configuration, it really just happens if the node is running for a longer while, I had the bn running now for 15 minutes and didn't see the issue, but on two consecutive runs which had an uptime of 12+ hours I had the issue. I can't tell what the minimum amount is the bn has to run or if it happens consistently.

My command

./lodestar beacon \
    --dataDir /home/devops/goerli/data/beacon \
    --rest \
    --rest.namespace '*' \
    --rest.address "0.0.0.0" \
    --metrics \
    --execution.urls http://localhost:8551 \
    --jwt-secret /home/devops/goerli/data/jwtsecret \
    --logLevel info \
    --network goerli \
    --checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/"  \
    --subscribeAllSubnets

nflaig · 2023-10-19T12:45:13Z

Just to highlight the difference of this issue compared to #5642 and #5775

What happens now is that libp2p stop never resolves, we basically have a hanging promise. The other two issues are related to an active handle that is left behind by libp2p after stop is already resolved. The active handle causes issues shutting down the main process and since we switched to a worker preventing that from being terminated by Thread.terminate which ends up as a hanging promise in the end as well.

Let me know if you need more details. As you can see based on the issues I spent quite some time debugging this as well.

wemeetagain · 2023-10-23T14:11:28Z

@achingbrain any ideas?

wemeetagain · 2023-10-23T15:53:21Z

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

nflaig · 2023-10-23T16:29:15Z

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

Maybe wrap libp2p.stop into a withTimeout?

lodestar/packages/beacon-node/src/network/peers/peerManager.ts

Line 664 in 18dc0c9

    
           await withTimeout(() => this.reqResp.sendGoodbye(peer, BigInt(goodbye)), 1_000);

achingbrain · 2023-10-23T17:51:12Z

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

nflaig · 2023-10-23T17:54:01Z

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

It not related to active handles, libp2p.stop promise just never resolves

achingbrain · 2023-10-23T17:57:06Z

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

nflaig · 2023-10-23T18:02:25Z

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

The issue is not that easy to reproduce, right now I only have the logs mentioned here #6053 (comment). I will run Lodestar with DEBUG="libp2p:*" and see if I can catch it again.

nflaig · 2023-10-24T09:32:41Z

@achingbrain I caught some DEBUG logs (libp2p-debug.log), I didn't see any libp2p:* logs afterwards but the libp2p.stop promise never resolved, and the process was not shutting down due to that.

You can ignore the Error: PublishError.InsufficientPeers logs, this seems to be an issue that our REST API is still allowing requests to come in even though the network is closing.

achingbrain · 2023-10-24T09:40:20Z

It's not clear from the logs when shutdown is started (this line is not present) but it looks like it's still accepting incoming connections so perhaps the tcp socket isn't being closed properly now.

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

Or could one of these lines be throwing and the error swallowed?

nflaig · 2023-10-24T10:09:44Z

It's not clear from the logs when shutdown is started (this line is not present)

Was capturing the logs in tmux which seems to only capture the last 2k lines...

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

It happens since #6015 which bumped @libp2p/tcp from 8.0.2 to 8.0.8 but it also includes a bunch of other libp2p dependency updates

Or could one of these lines be throwing and the error swallowed?

No, those are really harmless and just remove listeners. Code has not been modified there recently

E.g. when pinned to 8.0.7

I can try pinning @libp2p/tcp to 8.0.7 but the problem is that it is only easy to confirm that the issue still exists, can't really confirm that it doesn't. We had issues before with libp2p actives handles (#5642 and #5775) and haven't seen those for a week and regarded them as resolved, but then caught them again...

nflaig · 2023-10-27T10:02:40Z

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

Haven't seen the issue so far when downgrading @libp2p/tcp to 8.0.7, it's been only 3 days so can't tell for sure, but seems to be a high probability that the issue was introduced in @libp2p/tcp v8.0.8

nflaig · 2023-10-31T11:28:19Z

@achingbrain it's been a week now and I haven't seen the issue, whereas before it would happen at least once a day on a long running beacon node process. If libp2p/js-libp2p#2058 was the only change which got included in v8.0.8 I'm pretty confident that it is the cause of the problem.

achingbrain · 2023-10-31T12:08:24Z

The release notes for @libp2p/tcp@8.0.8 are here and the only change was libp2p/js-libp2p#2058

The change in that PR was around TCP shutdown and since lodestar is now failing to shut down it seems like a thread worth pulling on.

@nazarhussain @wemeetagain @maschad - can you please take a look and see if the changes in that PR have introduced this issue?

nflaig · 2024-04-19T10:34:10Z

@achingbrain is libp2p/js-libp2p#2421 expected to fix this issue?

I can still see timeouts due to libp2p.stop resolving but our unstable branch should have that change as we recently updated @libp2p/tcp in #6676.

This issue is quite easy to reproduce now since we bumped our peer count to 100 as it became much more frequent.

Was thinking maybe 5s is no longer sufficient for the timeout but removing it (unstable...nflaig/remove-libp2p-stop-timeout) introduces this issue again where the process is just stuck, as promise never resolves.

nflaig · 2024-10-15T18:37:43Z

Might be related to race condition causes @libp2p/tcp close method to hang libp2p/js-libp2p#2760

nflaig added the meta-bug Issues that identify a bug and require a fix. label Oct 19, 2023

nflaig added this to the v1.12.0 milestone Oct 19, 2023

nflaig mentioned this issue Oct 24, 2023

fix: close REST API server before closing network #6061

Merged

nflaig mentioned this issue Oct 25, 2023

fix: wrap libp2p stop call with a timeout #6062

Merged

philknows removed this from the v1.12.0 milestone Nov 24, 2023

philknows added the prio-low This is nice to have. label Nov 27, 2023

achingbrain mentioned this issue Feb 21, 2024

Process does not terminate after calling stop() ipfs/helia#309

Closed

nflaig added prio-medium Resolve this some time soon (tm). and removed prio-low This is nice to have. labels Feb 28, 2024

nflaig mentioned this issue Oct 15, 2024

Network worker not shutting down #5775

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Libp2p stop never resolves when shutting down node #6053

Libp2p stop never resolves when shutting down node #6053

nflaig commented Oct 19, 2023

nazarhussain commented Oct 19, 2023

nflaig commented Oct 19, 2023 •

edited

Loading

nflaig commented Oct 19, 2023

wemeetagain commented Oct 23, 2023

wemeetagain commented Oct 23, 2023

nflaig commented Oct 23, 2023

achingbrain commented Oct 23, 2023

nflaig commented Oct 23, 2023

achingbrain commented Oct 23, 2023

nflaig commented Oct 23, 2023 •

edited

Loading

nflaig commented Oct 24, 2023

achingbrain commented Oct 24, 2023 •

edited

Loading

nflaig commented Oct 24, 2023

nflaig commented Oct 27, 2023

nflaig commented Oct 31, 2023

achingbrain commented Oct 31, 2023

nflaig commented Apr 19, 2024

nflaig commented Oct 15, 2024 •

edited

Loading

Libp2p stop never resolves when shutting down node #6053

Libp2p stop never resolves when shutting down node #6053

Comments

nflaig commented Oct 19, 2023

Describe the bug

Expected behavior

Steps to reproduce

Additional context

Operating system

Lodestar version or commit hash

nazarhussain commented Oct 19, 2023

nflaig commented Oct 19, 2023 • edited Loading

nflaig commented Oct 19, 2023

wemeetagain commented Oct 23, 2023

wemeetagain commented Oct 23, 2023

nflaig commented Oct 23, 2023

achingbrain commented Oct 23, 2023

nflaig commented Oct 23, 2023

achingbrain commented Oct 23, 2023

nflaig commented Oct 23, 2023 • edited Loading

nflaig commented Oct 24, 2023

achingbrain commented Oct 24, 2023 • edited Loading

nflaig commented Oct 24, 2023

nflaig commented Oct 27, 2023

nflaig commented Oct 31, 2023

achingbrain commented Oct 31, 2023

nflaig commented Apr 19, 2024

nflaig commented Oct 15, 2024 • edited Loading

nflaig commented Oct 19, 2023 •

edited

Loading

nflaig commented Oct 23, 2023 •

edited

Loading

achingbrain commented Oct 24, 2023 •

edited

Loading

nflaig commented Oct 15, 2024 •

edited

Loading