Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overall improvement of the Debug Controller #2437

Merged
merged 12 commits into from
Mar 6, 2023

Conversation

Shiranuit
Copy link
Contributor

@Shiranuit Shiranuit commented Feb 28, 2023

What does this PR do ?

This PR makes a lot of improvement to the Kuzzle Cluster and Debugger stability while debugging:

Node eviction prevention

The node eviction prevention mode has been improved, before when a node was used to debug the cluster was notified to not evict this node even if its heartbeat was not received in time by the other nodes that were subscribed.
This is necessary when doing a snapshot of the Node process which causes the node process to not be active during the snapshot.
However when the Node that was prevented from being evicted while debugging had recovered after taking a heap snapshot, there was a case when this Node would evict other nodes because it had not received any heartbeat during the time it was taking the heap snapshot since the node threads were frozen.

A change has been made to now revoke the right of a node to evict other nodes when this node is currently prevented from being evicted from the cluster.

Another changed has been made to the ID Card Renewer, before this PR the ID Card Renewer was a worker thread that had for unique goal to maintain the ID Card data alive inside Redis.
But when taking heap snapshot the debugger is taking a snapshot of each threads owned by the process which was freezing the ID Card Renewer and preventing it from renewing the ID Card in time, which was causing the node to evict himself, and not being seen by new nodes.
Now the ID Card Renewer is created inside another process via a fork and maintains a communication channel with the main process.

Segfault when taking a Heap Snapshot

We discovered an issue when taking a Heap Snapshot via the Inspector that was causing NodeJS to segfault when the reportProgress flag was set to true because the inspector was trying to make a call into the JS Heap while the heap is being inspected for the snapshot.
See this issue: nodejs/node#44634

Now, when reportProgress is set to true we override it and change it to false, but this causes the Chrome Inspector to not parse the snapshot even though it has received every chunk of it.
The Chrome Inspector relies on the event emitted by reportProgress to show a progress bar that represent the state of the ongoing snapshot, and to know when the snapshot is finished this way it can wait for all the snapshot chunks to be received to start the parsing.
In order to force the Chrome Inspector to parse the snapshot we fake the HeapProfiler.reportHeapSnapshotProgress and send the following params directly after receiving reportProgress:true:

{
  done: 0,
  finished: true,
  total: 0,
}

The done and total parameters are only used to show a progress bar and nothing else, so they're not important, only the finished property is used by the Chrome Inspector to spawn a worker that will wait for all the chunks to be received and start parsing the snapshot.
Sending the event HeapProfiler.reportHeapSnapshotProgress before any of the HeapProfiler.addHeapSnapshotChunk events are sent has no impact, infact this what happens all the time when taking a snapshot, first multiple HeapProfiler.reportHeapSnapshotProgress are emitted until one is emitted with finished:true, then all the HeapProfiler.addHeapSnapshotChunk are emitted with the snapshot chunks.

Allow socket connections used for debugging to bypass some of Kuzzle's limitations

When taking a Heap Snapshot the debugger emits a lot of HeapProfiler.addHeapSnapshotChunk event which contains the snapshot chunks, those chunks are often bigger than 1mb and they're a lot of them, so much than it can cause them to accumulate in the BackPressure buffer dedicated to the socket connection and even exceed the BackPressure buffer limit from the KuzzleRC causing the socket to be closed by Kuzzle.
This is a huge problem since it disconnects the debugger, and prevents us from receiving all the chunks of a snapshot.

To avoid this, when a websocket connection is using the debugger and listen to any event from the debugger by calling debug:addListener the websocket connection is marked as used for debugging and is allowed to exceed such limitations.

Another issue that could occur is when Kuzzle request are added to the pending buffer if there are too many request to process, what could happens is request issued to the debug controller meant to debug Kuzzle might be put at the end of the buffer and never be processed or be processed so much later that it renders their result useless.
To counter that, requests from websocket connections previously marked as used for debugging are put at the top of the queue allowing them to be processed first by Kuzzle when Kuzzle is under pressure.

@kuzzle
Copy link
Contributor

kuzzle commented Feb 28, 2023

Executed Checks

✔️ Branch Merge Check

✔️ Changelog Tag Check

✔️ PR Body Check

Generated by 🚫 dangerJS against 1783d66

@Shiranuit Shiranuit marked this pull request as ready for review February 28, 2023 14:56
lib/api/funnel.js Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Mar 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@Shiranuit Shiranuit merged commit 3b87675 into 2-dev Mar 6, 2023
@Shiranuit Shiranuit mentioned this pull request Mar 6, 2023
@Shiranuit Shiranuit deleted the improve-node-eviction-prevention branch April 13, 2023 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants