Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libp2p stop never resolves when shutting down node #6053

Open
nflaig opened this issue Oct 19, 2023 · 18 comments
Open

Libp2p stop never resolves when shutting down node #6053

nflaig opened this issue Oct 19, 2023 · 18 comments
Labels
meta-bug Issues that identify a bug and require a fix. prio-medium Resolve this some time soon (tm).

Comments

@nflaig
Copy link
Member

nflaig commented Oct 19, 2023

Describe the bug

There seems to be an issue since we have updated libp2p in #6015. In some cases, the beacon node never shuts down and just hangs indefinitely. This does not happen all the time but if the node is running for a longer time it seems to happen quite frequently.

Based on the logs it is clear that the issue is await this.libp2p.stop(); which just never resolves

See libp2p-shutdown-issue-full.log

devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "closing network core running in network worker"
Oct-19 11:40:56.818[network]         debug: closing network core running in network worker
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network sent goodbye to all peers"
Oct-19 11:40:57.826[network]         debug: network sent goodbye to all peers
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network peerManager closed"
Oct-19 11:40:57.838[network]         debug: network peerManager closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network reqResp closed"
Oct-19 11:40:57.863[network]         debug: network reqResp closed
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "network lib2p closed"
devops@Ubuntu-2204-jammy-amd64-base:~/goerli/data/beacon$ cat beacon-2023-10-19.log | grep "terminating network worker"

This was also brought up by Sea Monkey on discord, corresponding logs sea-monkey-shutdown-issue.log

Expected behavior

Libp2p stop should always resolve in a timely manner

Steps to reproduce

Run beacon node on any network

Additional context

Operating system

Linux

Lodestar version or commit hash

unstable (ab2dfdd)

@nflaig nflaig added the meta-bug Issues that identify a bug and require a fix. label Oct 19, 2023
@nflaig nflaig added this to the v1.12.0 milestone Oct 19, 2023
@nazarhussain
Copy link
Contributor

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

@nflaig
Copy link
Member Author

nflaig commented Oct 19, 2023

@nflaig Do you know if there were some special configuration passed to node when this behavior triggered?

I don't think it is related to specific configuration, it really just happens if the node is running for a longer while, I had the bn running now for 15 minutes and didn't see the issue, but on two consecutive runs which had an uptime of 12+ hours I had the issue. I can't tell what the minimum amount is the bn has to run or if it happens consistently.

My command

./lodestar beacon \
    --dataDir /home/devops/goerli/data/beacon \
    --rest \
    --rest.namespace '*' \
    --rest.address "0.0.0.0" \
    --metrics \
    --execution.urls http://localhost:8551 \
    --jwt-secret /home/devops/goerli/data/jwtsecret \
    --logLevel info \
    --network goerli \
    --checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/"  \
    --subscribeAllSubnets

@nflaig
Copy link
Member Author

nflaig commented Oct 19, 2023

Just to highlight the difference of this issue compared to #5642 and #5775

What happens now is that libp2p stop never resolves, we basically have a hanging promise. The other two issues are related to an active handle that is left behind by libp2p after stop is already resolved. The active handle causes issues shutting down the main process and since we switched to a worker preventing that from being terminated by Thread.terminate which ends up as a hanging promise in the end as well.

Let me know if you need more details. As you can see based on the issues I spent quite some time debugging this as well.

@wemeetagain
Copy link
Member

@achingbrain any ideas?

@wemeetagain
Copy link
Member

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

@nflaig
Copy link
Member Author

nflaig commented Oct 23, 2023

Can we look into implementing a hack to unblock our v1.12 while we take time to figure out the root cause?

Maybe wrap libp2p.stop into a withTimeout?

await withTimeout(() => this.reqResp.sendGoodbye(peer, BigInt(goodbye)), 1_000);

@achingbrain
Copy link

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

@nflaig
Copy link
Member Author

nflaig commented Oct 23, 2023

any ideas?

Nothing specific but I find why-is-node-running very helpful when tracking down active handles that stop processes from exiting.

It not related to active handles, libp2p.stop promise just never resolves

@achingbrain
Copy link

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

@nflaig
Copy link
Member Author

nflaig commented Oct 23, 2023

The libp2p.stop method just stops each component. Do you have any logs available to see which component is causing the problem?

The issue is not that easy to reproduce, right now I only have the logs mentioned here #6053 (comment). I will run Lodestar with DEBUG="libp2p:*" and see if I can catch it again.

@nflaig
Copy link
Member Author

nflaig commented Oct 24, 2023

@achingbrain I caught some DEBUG logs (libp2p-debug.log), I didn't see any libp2p:* logs afterwards but the libp2p.stop promise never resolved, and the process was not shutting down due to that.

You can ignore the Error: PublishError.InsufficientPeers logs, this seems to be an issue that our REST API is still allowing requests to come in even though the network is closing.

@achingbrain
Copy link

achingbrain commented Oct 24, 2023

It's not clear from the logs when shutdown is started (this line is not present) but it looks like it's still accepting incoming connections so perhaps the tcp socket isn't being closed properly now.

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

Or could one of these lines be throwing and the error swallowed?

@nflaig
Copy link
Member Author

nflaig commented Oct 24, 2023

It's not clear from the logs when shutdown is started (this line is not present)

Was capturing the logs in tmux which seems to only capture the last 2k lines...

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

It happens since #6015 which bumped @libp2p/tcp from 8.0.2 to 8.0.8 but it also includes a bunch of other libp2p dependency updates

Or could one of these lines be throwing and the error swallowed?

No, those are really harmless and just remove listeners. Code has not been modified there recently

E.g. when pinned to 8.0.7

I can try pinning @libp2p/tcp to 8.0.7 but the problem is that it is only easy to confirm that the issue still exists, can't really confirm that it doesn't. We had issues before with libp2p actives handles (#5642 and #5775) and haven't seen those for a week and regarded them as resolved, but then caught them again...

@nflaig
Copy link
Member Author

nflaig commented Oct 27, 2023

Does it happen with @libp2p/tcp from before libp2p/js-libp2p#2058? E.g. when pinned to 8.0.7

Haven't seen the issue so far when downgrading @libp2p/tcp to 8.0.7, it's been only 3 days so can't tell for sure, but seems to be a high probability that the issue was introduced in @libp2p/tcp v8.0.8

@nflaig
Copy link
Member Author

nflaig commented Oct 31, 2023

@achingbrain it's been a week now and I haven't seen the issue, whereas before it would happen at least once a day on a long running beacon node process. If libp2p/js-libp2p#2058 was the only change which got included in v8.0.8 I'm pretty confident that it is the cause of the problem.

@achingbrain
Copy link

The release notes for @libp2p/tcp@8.0.8 are here and the only change was libp2p/js-libp2p#2058

The change in that PR was around TCP shutdown and since lodestar is now failing to shut down it seems like a thread worth pulling on.

@nazarhussain @wemeetagain @maschad - can you please take a look and see if the changes in that PR have introduced this issue?

@philknows philknows removed this from the v1.12.0 milestone Nov 24, 2023
@philknows philknows added the prio-low This is nice to have. label Nov 27, 2023
@nflaig nflaig added prio-medium Resolve this some time soon (tm). and removed prio-low This is nice to have. labels Feb 28, 2024
@nflaig
Copy link
Member Author

nflaig commented Apr 19, 2024

@achingbrain is libp2p/js-libp2p#2421 expected to fix this issue?

I can still see timeouts due to libp2p.stop resolving but our unstable branch should have that change as we recently updated @libp2p/tcp in #6676.

This issue is quite easy to reproduce now since we bumped our peer count to 100 as it became much more frequent.

Was thinking maybe 5s is no longer sufficient for the timeout but removing it (unstable...nflaig/remove-libp2p-stop-timeout) introduces this issue again where the process is just stuck, as promise never resolves.

@nflaig
Copy link
Member Author

nflaig commented Oct 15, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix. prio-medium Resolve this some time soon (tm).
Projects
None yet
Development

No branches or pull requests

5 participants