test: avoid test-cluster-master-kill flakiness #6531

stefanmb · 2016-05-02T20:54:46Z

Checklist

tests and code linting passes
a test and/or benchmark is included
the commit message follows commit guidelines

Affected core subsystem(s)

test, cluster

Description of change

I've observed that test-cluster-master-kill fails intermittently on an AIX 6.1 machine (oslevel 6100-07-08-1339) due to timeout before worker termination. There was a previous PR (nodejs/node-v0.x-archive#9431) which arbitrarily increased the timeout for AIX to 1 second, however this value is still a guess and appears to be insufficient. It's also worth noting the arbitrary timeouts have also caused problems for other platforms, see #2891 (comment). Arbitrary timeouts cannot compensate for external factors such as system load.

In this PR I propose removing the timeout mechanism entirely, here is how the test currently works:

Fork a master process from the test harness.
The master uses cluster.fork to create a worker.
The worker starts an HTTP server.
The master will kill itself with process.exit.
The worker will then also call process.exit as part of its IPC pipe disconnect handler. (*)
The test harness will wait an arbitrary amount of time after the master's exit to check for the child's liveness, if it is still alive the test fails.

(*) Without this mechanism the worker would become an orphan child of init.

Step 6 is inherently flaky. The test was originally added as part of nodejs/node-v0.x-archive@94d337e#diff-0faa53fc02580d5de2ebb484c41d691cR498 where it specifically tested the new disconnect pathway.

The test boils down to the following actions:

Cause the worker's disconnect handler to be run by calling exit(0) in the cluster master.
Insure that the disconnect handler also exits the worker process.

Since step 2 does not actually require step 1, I propose the following alternate flow:

Fork a master process from the test harness.
The master uses cluster.fork to create a worker.
The worker starts an HTTP server.
The master kills the IPC pipe (note this not the same as using the graceful cluster shutdown API in nodejs/node-v0.x-archive@6c383c6) which causes the child to disconnect itself. Ungracefully closing the IPC pipe causes the child to execute the same disconnect handler it executes when the master calls exit(0).
The master waits for the child's exit event.
The test harness checks for the child's liveness after the master's exit event.

With this setup there is no need for the arbitrary wait time. The obvious problem is if the child never exits the test will hang - however in that case the test will still be killed by the test harness's global timeout.

I do believe a timeout mechanism is useful for detecting liveness issues, but the presence of arbitrary timeouts in the tests themselves should be minimized, the single timeout in the test harness suffices.

Any comments are appreciated, especially from @AndreasMadsen.

Removed reliance on worker exit before arbitrary timeout. Instead of killing the parent process, destroy the IPC pipe for the same effect, and wait for the worker's exit before also exiting the parent. Insuring these steps are well-ordered removes the need for timeouts and reduces intermittent failures. In case of an actual hang in the child's exit code, the test harness global timeout will kick in, and the test will still fail.

stefanmb · 2016-05-02T20:56:48Z

CI: https://ci.nodejs.org/job/node-test-pull-request/2466/

sam-github · 2016-05-02T21:58:17Z

test/parallel/test-cluster-master-kill.js

+/* Cluster.disconnect() will exercise a different 'graceful' shutdown path.
+   From the perspective of the worker, closing the channel is equivalent
+   to the parent calling process.exit(0). */
+      worker.process._channel.close();


I don't object, but this does use undocumented APIs, and the last test didn't. That can be annoying when trying to rearrange things that are supposed to be internal.

Why not just change the check at the end to not do a single check after some randomly chosen time interval, but to run the check every half-second, indefinitely, until it passes or the runner kills this?

@sam-github That would be another way to do this - I'm not a huge fan of polling but it may be the lesser evil. I can do another version with your suggestion. Thanks.

stefanmb · 2016-05-03T02:19:00Z

@sam-github I've adjusted the test as you suggested to simply loop indefinitely.

CI: https://ci.nodejs.org/job/node-test-pull-request/2473/

santigimeno · 2016-05-03T09:52:13Z

Maybe it makes sense applying the same changes to test-cluster-master-error.
Other than that LGTM.

stefanmb · 2016-05-03T14:40:22Z

@santigimeno Thanks for pointing that out - I've updated test-cluster-master-error with the same approach (and also adjusted some variable naming and comments in it for easier understanding).

CI: https://ci.nodejs.org/job/node-test-pull-request/2479/

mhdawson · 2016-05-03T17:08:22Z

LGTM.

santigimeno · 2016-05-03T18:30:45Z

Duplicate of #5056?

ncopa · 2016-05-03T19:46:19Z

test/parallel/test-cluster-master-kill.js

+    var pollWorker = function() {
+      alive = isAlive(pid);
+      if (alive) {
+        setTimeout(pollWorker, 500);


i think you safely can set the timeout value lower, so we don´t slow down the test up to a half second.

stefanmb · 2016-05-03T20:07:41Z

@ncopa Updated with 50ms polling interval instead of 500ms. Thanks.

AndreasMadsen · 2016-05-04T14:00:52Z

Nice analysis. I too is much more comfortable about polling than just closing the IPC. I remember there have been issues in the past where the disconnect event did not fire when the parent died. But it's a long time ago, I could be wrong.

LGTM

stefanmb · 2016-05-04T15:54:54Z

One more CI run with the last change: https://ci.nodejs.org/job/node-test-pull-request/2496/

@ncopa After the CI run I'll merge this PR and close #5056 as they are largely equivalent.

Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>

stefanmb · 2016-05-04T18:08:49Z

Landed as fc66e55.

drewfish · 2016-05-06T05:16:22Z

I wonder if this will fix #6193 as well.

Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>

MylesBorins · 2016-06-02T00:18:50Z

@stefanmb lts?

stefanmb · 2016-06-02T14:09:00Z

@thealphanerd I think this is OK for LTS. Thanks for reminding me.

Removed reliance on worker exit before arbitrary timeout. Instead of failing the test after 200 or 1000 ms wait indefinitely for child process exit. If the test hangs the test harness global timeout will kick in and fail the test. Note that if the orphaned children are not reaped correctly (in the absence of init, e.g. Docker) the test will hang and the harness will fail it. PR-URL: #6531 Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com> Reviewed-By: Andreas Madsen <amwebdk@gmail.com> Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>

nodejs-github-bot added the test Issues and PRs related to the tests. label May 2, 2016

stefanmb added the cluster Issues and PRs related to the cluster subsystem. label May 2, 2016

stefanmb self-assigned this May 2, 2016

sam-github reviewed May 2, 2016
View reviewed changes

WIP: Poll indefinitely

fadbde5

WIP: Also change test-cluster-master-error

89b0ff4

stefanmb mentioned this pull request May 3, 2016

test: make cluster test more time tolerant on more platforms #5056

Closed

ncopa reviewed May 3, 2016
View reviewed changes

WIP: Reduce poll interval to 50ms

1246090

stefanmb closed this May 4, 2016

evanlucas mentioned this pull request May 17, 2016

2016-05-17, Version 6.2.0 (Stable) #6810

Merged

MylesBorins added the lts-watch-v4.x label Jun 2, 2016

MylesBorins added land-on-v4.x and removed lts-watch-v4.x labels Jul 11, 2016

MylesBorins mentioned this pull request Jul 12, 2016

v4.5.0 proposal #7688

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: avoid test-cluster-master-kill flakiness #6531

test: avoid test-cluster-master-kill flakiness #6531

stefanmb commented May 2, 2016

stefanmb commented May 2, 2016

sam-github May 2, 2016

stefanmb May 2, 2016

stefanmb commented May 3, 2016

santigimeno commented May 3, 2016

stefanmb commented May 3, 2016

mhdawson commented May 3, 2016

santigimeno commented May 3, 2016

ncopa May 3, 2016

stefanmb commented May 3, 2016

AndreasMadsen commented May 4, 2016

stefanmb commented May 4, 2016

stefanmb commented May 4, 2016

drewfish commented May 6, 2016

MylesBorins commented Jun 2, 2016

stefanmb commented Jun 2, 2016

test: avoid test-cluster-master-kill flakiness #6531

test: avoid test-cluster-master-kill flakiness #6531

Conversation

stefanmb commented May 2, 2016

Checklist

Affected core subsystem(s)

Description of change

stefanmb commented May 2, 2016

sam-github May 2, 2016

Choose a reason for hiding this comment

stefanmb May 2, 2016

Choose a reason for hiding this comment

stefanmb commented May 3, 2016

santigimeno commented May 3, 2016

stefanmb commented May 3, 2016

mhdawson commented May 3, 2016

santigimeno commented May 3, 2016

ncopa May 3, 2016

Choose a reason for hiding this comment

stefanmb commented May 3, 2016

AndreasMadsen commented May 4, 2016

stefanmb commented May 4, 2016

stefanmb commented May 4, 2016

drewfish commented May 6, 2016

MylesBorins commented Jun 2, 2016

stefanmb commented Jun 2, 2016