Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: avoid test-cluster-master-kill flakiness #6531

Closed
wants to merge 4 commits into from

Conversation

stefanmb
Copy link
Contributor

@stefanmb stefanmb commented May 2, 2016

Checklist
  • tests and code linting passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)

test, cluster

Description of change

I've observed that test-cluster-master-kill fails intermittently on an AIX 6.1 machine (oslevel 6100-07-08-1339) due to timeout before worker termination. There was a previous PR (nodejs/node-v0.x-archive#9431) which arbitrarily increased the timeout for AIX to 1 second, however this value is still a guess and appears to be insufficient. It's also worth noting the arbitrary timeouts have also caused problems for other platforms, see #2891 (comment). Arbitrary timeouts cannot compensate for external factors such as system load.

In this PR I propose removing the timeout mechanism entirely, here is how the test currently works:

  1. Fork a master process from the test harness.
  2. The master uses cluster.fork to create a worker.
  3. The worker starts an HTTP server.
  4. The master will kill itself with process.exit.
  5. The worker will then also call process.exit as part of its IPC pipe disconnect handler. (*)
  6. The test harness will wait an arbitrary amount of time after the master's exit to check for the child's liveness, if it is still alive the test fails.

(*) Without this mechanism the worker would become an orphan child of init.

Step 6 is inherently flaky. The test was originally added as part of nodejs/node-v0.x-archive@94d337e#diff-0faa53fc02580d5de2ebb484c41d691cR498 where it specifically tested the new disconnect pathway.

The test boils down to the following actions:

  1. Cause the worker's disconnect handler to be run by calling exit(0) in the cluster master.
  2. Insure that the disconnect handler also exits the worker process.

Since step 2 does not actually require step 1, I propose the following alternate flow:

  1. Fork a master process from the test harness.
  2. The master uses cluster.fork to create a worker.
  3. The worker starts an HTTP server.
  4. The master kills the IPC pipe (note this not the same as using the graceful cluster shutdown API in nodejs/node-v0.x-archive@6c383c6) which causes the child to disconnect itself. Ungracefully closing the IPC pipe causes the child to execute the same disconnect handler it executes when the master calls exit(0).
  5. The master waits for the child's exit event.
  6. The test harness checks for the child's liveness after the master's exit event.

With this setup there is no need for the arbitrary wait time. The obvious problem is if the child never exits the test will hang - however in that case the test will still be killed by the test harness's global timeout.

I do believe a timeout mechanism is useful for detecting liveness issues, but the presence of arbitrary timeouts in the tests themselves should be minimized, the single timeout in the test harness suffices.

Any comments are appreciated, especially from @AndreasMadsen.

Removed reliance on worker exit before arbitrary timeout. Instead of killing
the parent process, destroy the IPC pipe for the same effect, and wait for
the worker's exit before also exiting the parent. Insuring these steps are
well-ordered removes the need for timeouts and reduces intermittent failures.

In case of an actual hang in the child's exit code, the test harness global
timeout will kick in, and the test will still fail.
@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label May 2, 2016
@stefanmb stefanmb added the cluster Issues and PRs related to the cluster subsystem. label May 2, 2016
@stefanmb stefanmb self-assigned this May 2, 2016
@stefanmb
Copy link
Contributor Author

stefanmb commented May 2, 2016

/* Cluster.disconnect() will exercise a different 'graceful' shutdown path.
From the perspective of the worker, closing the channel is equivalent
to the parent calling process.exit(0). */
worker.process._channel.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object, but this does use undocumented APIs, and the last test didn't. That can be annoying when trying to rearrange things that are supposed to be internal.

Why not just change the check at the end to not do a single check after some randomly chosen time interval, but to run the check every half-second, indefinitely, until it passes or the runner kills this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-github That would be another way to do this - I'm not a huge fan of polling but it may be the lesser evil. I can do another version with your suggestion. Thanks.

@stefanmb
Copy link
Contributor Author

stefanmb commented May 3, 2016

@sam-github I've adjusted the test as you suggested to simply loop indefinitely.

CI: https://ci.nodejs.org/job/node-test-pull-request/2473/

@santigimeno
Copy link
Member

Maybe it makes sense applying the same changes to test-cluster-master-error.
Other than that LGTM.

@stefanmb
Copy link
Contributor Author

stefanmb commented May 3, 2016

@santigimeno Thanks for pointing that out - I've updated test-cluster-master-error with the same approach (and also adjusted some variable naming and comments in it for easier understanding).

CI: https://ci.nodejs.org/job/node-test-pull-request/2479/

@mhdawson
Copy link
Member

mhdawson commented May 3, 2016

LGTM.

@santigimeno
Copy link
Member

Duplicate of #5056?

var pollWorker = function() {
alive = isAlive(pid);
if (alive) {
setTimeout(pollWorker, 500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you safely can set the timeout value lower, so we don´t slow down the test up to a half second.

@stefanmb
Copy link
Contributor Author

stefanmb commented May 3, 2016

@ncopa Updated with 50ms polling interval instead of 500ms. Thanks.

@AndreasMadsen
Copy link
Member

Nice analysis. I too is much more comfortable about polling than just closing the IPC. I remember there have been issues in the past where the disconnect event did not fire when the parent died. But it's a long time ago, I could be wrong.

LGTM

@stefanmb
Copy link
Contributor Author

stefanmb commented May 4, 2016

One more CI run with the last change: https://ci.nodejs.org/job/node-test-pull-request/2496/

@ncopa After the CI run I'll merge this PR and close #5056 as they are largely equivalent.

stefanmb added a commit that referenced this pull request May 4, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@stefanmb
Copy link
Contributor Author

stefanmb commented May 4, 2016

Landed as fc66e55.

@stefanmb stefanmb closed this May 4, 2016
@drewfish
Copy link
Contributor

drewfish commented May 6, 2016

I wonder if this will fix #6193 as well.

evanlucas pushed a commit that referenced this pull request May 17, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@MylesBorins
Copy link
Contributor

@stefanmb lts?

@stefanmb
Copy link
Contributor Author

stefanmb commented Jun 2, 2016

@thealphanerd I think this is OK for LTS. Thanks for reminding me.

MylesBorins pushed a commit that referenced this pull request Jul 11, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jul 11, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jul 12, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jul 12, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
@MylesBorins MylesBorins mentioned this pull request Jul 12, 2016
MylesBorins pushed a commit that referenced this pull request Jul 14, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
MylesBorins pushed a commit that referenced this pull request Jul 14, 2016
Removed reliance on worker exit before arbitrary timeout. Instead of failing
the test after 200 or 1000 ms wait indefinitely for child process exit. If
the test hangs the test harness global timeout will kick in and fail the test.

Note that if the orphaned children are not reaped correctly (in the absence
of init, e.g. Docker) the test will hang and the harness will fail it.

PR-URL: #6531
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Andreas Madsen <amwebdk@gmail.com>
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants