Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beacon node active handles after close sequence is completed #5642

Open
nflaig opened this issue Jun 14, 2023 · 7 comments
Open

Beacon node active handles after close sequence is completed #5642

nflaig opened this issue Jun 14, 2023 · 7 comments
Labels
meta-bug Issues that identify a bug and require a fix.

Comments

@nflaig
Copy link
Member

nflaig commented Jun 14, 2023

Describe the bug

In some rare cases when the node is running for a longer time it does not exit when receiving a exit signal.

Active handles

[WTF Node?] open handles:
- File descriptors: (note: stdio always exists)
  - fd 2 (tty) (stdio)
  - fd 1 (tty) (stdio)
  - fd 0 (tty)
- Sockets:
  - 135.181.2.45:9000 -> undefined:undefined
  - 135.181.2.45:9000 -> undefined:undefined
.... (45 more)
  - 135.181.2.45:9000 -> undefined:undefined
  - 135.181.2.45:9000 -> 57.128.86.167:45906
  - 135.181.2.45:9000 -> 34.222.49.115:9252
- Servers:
  - 0.0.0.0:9000 (TCP)
    - Listeners:
      - connection: bound onSocket @ file:///home/devops/goerli/lodestar/node_modules/@libp2p/tcp/src/listener.ts:78
- Timers:
  - (30000 ~ 30 s) (anonymous) @ /home/devops/goerli/lodestar/node_modules/retimer/retimer.js:25

The Sockets that show 135.181.2.45:9000 -> undefined:undefined can probably be ignored as those are also there if beacon node exits correctly.

Debug logs

Jun-14 09:58:42.886[network]       verbose: Failed to send goodbye peer=16...3dHdMu Timeout
Error: Timeout
    at timeoutPromise (file:///home/devops/goerli/lodestar/packages/utils/src/timeout.ts:15:11)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:538:9)
    at processTimers (node:internal/timers:512:7)
    at withTimeout (file:///home/devops/goerli/lodestar/packages/utils/src/timeout.ts:19:12)
    at PeerManager.goodbyeAndDisconnect (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/peers/peerManager.ts:648:7)
    at async Promise.all (index 50)
    at PeerManager.goodbyeAndDisconnectAllPeers (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/peers/peerManager.ts:223:5)
    at NetworkCore.close (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/core/networkCore.ts:255:5)
    at Network.close (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/network.ts:197:5)
Jun-14 09:58:42.887[network]       verbose: Req  done method=goodbye, version=1, encoding=ssz_snappy, client=Lighthouse, peer=16...3dHdMu, requestId=1769
Jun-14 09:58:42.887[network]       verbose: Failed to send goodbye peer=16...GW2Ein Timeout
Error: Timeout
    at timeoutPromise (file:///home/devops/goerli/lodestar/packages/utils/src/timeout.ts:15:11)
    at withTimeout (file:///home/devops/goerli/lodestar/packages/utils/src/timeout.ts:19:12)
    at PeerManager.goodbyeAndDisconnect (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/peers/peerManager.ts:648:7)
    at async Promise.all (index 54)
    at PeerManager.goodbyeAndDisconnectAllPeers (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/peers/peerManager.ts:223:5)
    at NetworkCore.close (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/core/networkCore.ts:255:5)
    at Network.close (file:///home/devops/goerli/lodestar/packages/beacon-node/src/network/network.ts:197:5)
    at BeaconNode.close (file:///home/devops/goerli/lodestar/packages/beacon-node/src/node/nodejs.ts:316:7)
    at EventTarget.abortController.signal.addEventListener.once (file:///home/devops/goerli/lodestar/packages/cli/src/cmds/beacon/handler.ts:137:11)
Jun-14 09:58:42.889[network]         debug: network sent goodbye to all peers
Jun-14 09:58:42.889[network]       verbose: Req  done method=goodbye, version=1, encoding=ssz_snappy, client=Teku, peer=16...GW2Ein, requestId=1773
Jun-14 09:58:42.932[network]         debug: network peerManager closed
Jun-14 09:58:42.934[network]         debug: network gossip closed
Jun-14 09:58:43.038[network]         debug: Dialing discovered peer peer=16...HRUh4M
Jun-14 09:58:43.038[network]         debug: Dialing discovered peer peer=16...zqJUKj
Jun-14 09:58:43.130[network]         debug: network reqResp closed
Jun-14 09:58:43.146[network]         debug: Error dialing discovered peer peer=16...m56aPB All promises were rejected
AggregateError: All promises were rejected
Jun-14 09:58:43.146[network]         debug: Error dialing discovered peer peer=16...MS9Luf All promises were rejected
AggregateError: All promises were rejected
Jun-14 09:58:43.146[network]         debug: Error dialing discovered peer peer=16...HRUh4M All promises were rejected
AggregateError: All promises were rejected
Jun-14 09:58:43.152[network]         debug: Error dialing discovered peer peer=16...zqJUKj All promises were rejected
AggregateError: All promises were rejected
Jun-14 09:58:43.152[network]         debug: Error dialing discovered peer peer=16...yH119U All promises were rejected
AggregateError: All promises were rejected
Jun-14 09:58:43.153[network]         debug: network lib2p closed
Jun-14 09:58:43.153[network]         debug: network core closed
Jun-14 09:58:43.649[chain]         verbose: Archived finalized state finalizedEpoch=182885
Jun-14 09:58:44.617[]                debug: Beacon node closed

What looks really suspicious is that we are dialing peers after network is already closed

Jun-14 09:58:43.038[network]         debug: Dialing discovered peer peer=16...HRUh4M
Jun-14 09:58:43.038[network]         debug: Dialing discovered peer peer=16...zqJUKj

Expected behavior

Beacon node should always exit process after close sequence succeeds. No explicit process.exit should be required.

Steps to reproduce

Run beacon node for a while with the following command (or similar)

./lodestar beacon \
    --dataDir /home/devops/goerli/data/beacon \
    --rest \
    --rest.address "0.0.0.0" \
    --rest.namespace '*' \
    --metrics \
    --execution.urls http://localhost:8551 \
    --jwt-secret /home/devops/goerli/data/jwtsecret \
    --logLevel info \
    --network goerli \
    --checkpointSyncUrl "https://beaconstate-goerli.chainsafe.io/"

CTRL + C to exit gracefully and observe that it is not shutting down even though close sequence is executed successfully.

Operating system

Linux

Lodestar version or commit hash

unstable (bf58427)

@nflaig
Copy link
Member Author

nflaig commented Jun 26, 2023

Another issue reported on discord, the BN process is not shutting down but based on the logs (beacon-2023-06-25.log) it is hard to tell what the issue is.

We might have to call process.exit explicitly after beacon node closed if this issue can't be resolved.

logger.debug("Beacon node closed");

@nflaig nflaig changed the title Beacon node process does not always exit Beacon node active handles after close sequence is completed Jun 27, 2023
@nflaig
Copy link
Member Author

nflaig commented Jul 7, 2023

The issue seems to be fixed with upgrade to libp2p to 0.45.9 in 7280234.

The PR that fixed the issue:

@nflaig
Copy link
Member Author

nflaig commented Jul 11, 2023

Reopening as issue does not yet seem to be resolved.

Looks like we are pinging peers after sending goodbye

Jun-25 02:58:27.655[network]         �[34mdebug�[39m: Req  dialing peer method=goodbye, version=1, encoding=ssz_snappy, client=Lighthouse, peer=16...VYqoz3, requestId=295952
Jun-25 02:58:27.903[network]         �[34mdebug�[39m: Req  received method=ping, client=Lighthouse, peer=16...VYqoz3, requestId=295961
Jun-25 02:58:27.903[network]       �[36mverbose�[39m: Resp done method=ping, client=Lighthouse, peer=16...VYqoz3, requestId=295961

or still dialing peers while disconnecting (goodbyeAndDisconnectAllPeers)

Jun-25 02:58:28.311[network]         �[34mdebug�[39m: Req  dialing peer method=metadata, version=2, encoding=ssz_snappy, client=Unknown, peer=16...ohL5Xo, requestId=295963

Based on the current network close sequence it looks like we close the peer manager after disconnecting peers which could explain this behavior

await this.peerManager.goodbyeAndDisconnectAllPeers();
this.logger.debug("network sent goodbye to all peers");
await this.peerManager.close();
this.logger.debug("network peerManager closed");

as only after closing, the intervals and event listeners are removed

async close(): Promise<void> {
await this.discovery?.stop();
this.libp2p.services.components.events.removeEventListener(Libp2pEvent.connectionOpen, this.onLibp2pPeerConnect);
this.libp2p.services.components.events.removeEventListener(
Libp2pEvent.connectionClose,
this.onLibp2pPeerDisconnect
);
this.networkEventBus.off(NetworkEvent.reqRespRequest, this.onRequest);
for (const interval of this.intervals) clearInterval(interval);
}

Changing the order of the closing sequence will likely fix the issue, but extensive testing is required as this issue is hard to reproduce.

@nflaig nflaig reopened this Jul 11, 2023
@nflaig nflaig added the meta-investigate Issues found that require further investigation and may not have a specific resolution/fix label Jul 12, 2023
@nflaig
Copy link
Member Author

nflaig commented Jul 12, 2023

Proposed solution #5642 (comment) does not resolve the issue.

The problem seems to be with libp2p which in the end must take care of closing all connections / removing tcp listeners. There are several closed but also open issues regarding connections not being closed properly.

This comment libp2p/js-libp2p#436 (comment) summarizes open tasks but there was no progress in a while.

For now, we just have to explicitly exit process until upstream issues are fixed.

@nflaig nflaig added meta-bug Issues that identify a bug and require a fix. and removed meta-investigate Issues found that require further investigation and may not have a specific resolution/fix labels Jul 19, 2023
@nflaig
Copy link
Member Author

nflaig commented Oct 13, 2023

Potential fix has been merged to unstable

It is hard to verify if this actually fixed the issue due to the fact that it is not really reproducible and happens rarely.

Two things have to be done to confirm a fix

  • set --network.useWorker false and
  • explicit exit has to be removed
  • then run beacon node for multiple hours and shutdown process, likely after a week of doing that without hanging process we can be confident enough and close this issue

Note: there is a chance that the process would also hang with --network.useWorker true set (#5775 <-- this issue was still observed recently) but this happened even less often, hence it is harder to confirm a fix of the issue.

@wemeetagain
Copy link
Member

@nflaig I think this issue is resolved now?

@nflaig
Copy link
Member Author

nflaig commented Oct 15, 2024

I don't think so, did libp2p releae a fix for this? Since we switched to running libp2p in a worker we get a differnet issue #5775 but this one should be reproducible by setting --network.useWorker false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants