Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

cluster: don't assert if worker has no handles #8642

Closed
wants to merge 2 commits into from
Closed

cluster: don't assert if worker has no handles #8642

wants to merge 2 commits into from

Conversation

sam-github
Copy link

As it says.

@sam-github
Copy link
Author

@misterdjules this is a regression introduced in 0.11, trivial to repro and tiny to fix, do you have ability to get some motion on it?

@sam-github
Copy link
Author

At the very least it could be tagged as a bug.

@misterdjules
Copy link

Thank you @sam-github for the PR! A few observations:

  1. The dgram module is not loaded by the test, so it fails all the time.
  2. The master also doesn't know if a worker fails (failure to listen or to get online), and in this case the test times out.
  3. When running on Windows with the fix for 1), the test still fails with ENOTSUP:
C:\Users\jgilli\dev\node>.\Release\node.exe test\simple\test-cluster-disconnect-
unshared.js
events.js:85
      throw er; // Unhandled 'error' event
            ^
Error: write ENOTSUP
    at exports._errnoException (util.js:746:11)
    at ChildProcess.target._send (child_process.js:482:28)
    at ChildProcess.target.send (child_process.js:415:12)
    at sendHelper (cluster.js:675:8)
    at send (cluster.js:511:5)
    at cluster.js:487:7
    at SharedHandle.add (cluster.js:99:3)
    at queryServer (cluster.js:479:12)
    at Worker.onmessage (cluster.js:437:7)
    at ChildProcess.<anonymous> (cluster.js:691:8)

C:\Users\jgilli\dev\node>

Other minor suggestions:

  • There's a typo in the commit message: listenes -> listens.
  • It could be nice to add a comment in SharedHandle.prototype.remove to make why we allow for workers not to share a given handle a bit clearer. The cluster module is quite complicated to follow, and in my opinion a few more comments here and there could not hurt.

@sam-github
Copy link
Author

Thanks @misterdjules, I will fix this up.

@sam-github
Copy link
Author

I added a possible comment, but honestly, I think its the original assert that needed the comment, the code WITHOUT the comment is pretty clear. A worker is being removed from a handle, but what reason is there to expect every worker to listen on EVERY handle? For handles that a worker didn't listen on, no need to remove it, return.

I patched up the other small things.

Alas, as I don't have a windows node build env ATM, it'll take me some time tomorrow to get my dog-slow windows vm to grind out a build.

@misterdjules
Copy link

@sam-github Thanks!

Regarding the comment, I agree with you that it's actually overkill. My initial suggestion mostly came from the fact that I revisit the cluster's implementation only once in a while, and every time It takes me some time to remember how it all fits together. Sorry for that.

As for the Windows build, please let me know if you need any help with that, I have a pretty good Windows 7 VM running on my laptop.

@sam-github
Copy link
Author

Actually, I think I know what the deal is. dgram isn't supported by cluster on windows, and nothing but dgram uses the shared handle on linux.... so by default there isn't a single socket type that can be used to make a portable test, though the fix applies to any use of shared handle.

I'm going to force everything to be shared and use tcp, which should reliably show both the shared handle bug,and the fix, on windows and linux.

@sam-github
Copy link
Author

I spent about an hour trying to find a way to make this work on Windows, since dgram isn't supported on windows, and just made a mess. I'll try to get time to look at it again tomorrow, hopefully with all the Americans on vacation it will be quiet.

@sam-github
Copy link
Author

OK, I think I found my confusion. I thought I could trigger the SharedHandle to be used on linux and Windows by disabling round-robin, but that didn't work. It appears that all socket types other than dgram are not effected because there is some extra handle close messages that occur, but dgram doesn't do that.

So, it only effects dgram, and dgram isn't supported on Windows with cluster... so I'm comfortable saying this is a non-Windows test.

@sam-github
Copy link
Author

@misterdjules PTAL

@misterdjules
Copy link

While taking another look at it, I found that running the newly added test (test/simple/test-cluster-disconnect-unshared.js) doesn't fail (it doesn't assert) without the fix in lib/cluster.js. Shouldn't it assert without the fix?

If understand correctly, I think that we want the test to behave such as the worker that doesn't share the handle disconnects first, so that the master attempts to remove it from the handle. If the non-sharing worker disconnects after the sharing worker, then the test passes even without the fix.

I suggest maybe rewriting the fix like following:

if (cluster.isMaster) {
  var nonSharing = cluster.fork();
  nonSharing.on('online', bind);

  function bind() {
    var sharing = cluster.fork({BOUND: 'y'});
    sharing.on('listening', disconnect);
  }

  function disconnect() {
    /*
     * Make sure to disconnect the non-sharing worker first,
     * and only when it has disconnected, disconnect the rest
     * of the workers. Otherwise, the problematic behavior is
     * not triggered.
     */
    nonSharing.on('disconnect', function onNonSharedDisconnect() {
      cluster.disconnect();
    });

    nonSharing.disconnect();
  }
} else {
  if (process.env.BOUND === 'y') {
    var source = dgram.createSocket('udp4');

    source.bind(0);
  }
}

@sam-github What do you think?


if (cluster.isMaster) {
var messages = 0;
var ports = {};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are messages and ports used in this test?

@misterdjules
Copy link

@sam-github Also, running the test modified as mentioned above but with the fix that removes the assert in SharedHandle.prototype.remove, I get the following error:

$ ./node test/simple/test-cluster-disconnect-unshared.js 
events.js:85
      throw er; // Unhandled 'error' event
            ^
Error: channel closed
    at ChildProcess.target.send (child_process.js:413:26)
    at sendHelper (cluster.js:676:8)
    at send (cluster.js:512:5)
    at Worker.disconnect (cluster.js:419:5)
    at EventEmitter.cluster.disconnect (cluster.js:411:30)
    at Worker.onNonSharedDisconnect (/Users/JulienGilli/dev/node/test/simple/test-cluster-disconnect-unshared.js:48:15)
    at Worker.emit (events.js:104:17)
    at ChildProcess.<anonymous> (cluster.js:392:14)
    at ChildProcess.g (events.js:199:16)
    at ChildProcess.emit (events.js:104:17)
$

Here's how I fixed it locally:

diff --git a/lib/cluster.js b/lib/cluster.js
index 6f355d0..5e681f4 100644
--- a/lib/cluster.js
+++ b/lib/cluster.js
@@ -407,7 +407,8 @@ function masterInit() {
     } else {
       for (var key in workers) {
         key = workers[key];
-        cluster.workers[key].disconnect();
+        if (cluster.workers[key].isConnected())
+          cluster.workers[key].disconnect();
       }
     }
     if (cb) intercom.once('disconnect', cb);

However, I want to validate with you that the test as currently written in this PR needs to be fixed before discussing about that error.

Thank you!

@sam-github
Copy link
Author

I agree, the test had a scheduling order dependency. It was always working (failing without my fix) on my machine, but I can see why that would be fluke.

I added a stripped down test for the bug exposed by disconnect before exit.

@sam-github
Copy link
Author

PTAL @misterdjules

// disconnected worker would error.
worker.on('disconnect', cluster.disconnect);
}
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the else block here?

@misterdjules
Copy link

Thanks again @sam-github! It does look good to me. One last thing, could you please elaborate on:

It appears that all socket types other than dgram are not effected because there is some extra
handle close messages that occur

? I'm not sure I understand exactly how and when the "extra handle close close messages" happen for non-dgram.

Thank you!

@sam-github
Copy link
Author

@misterdjules I'm sorry, I was confused. The ordering dependency of the tests, wherein they always passed for dgram, but always failed for tcp, was confusing me. I wrote a tcp version that forces use of SharedHandle (non-round robin) and renamed the existing tests. I've confirmed the tests pass on linux and windows, and fail without my changes, on linux and windows (dgram is skipped on windows, though, because node cluster doesn't support dgram on windows).

@misterdjules
Copy link

@sam-github No worries, and thanks for the update. I don't know if you got that on IRC, but I'm off until January 5th, so I probably won't have time to review it until then.

@misterdjules
Copy link

@sam-github Did you have a chance to look at my comment? Otherwise LGTM. Depending on where we are with the merge from v0.10 to v0.12, we'll land it before or after the merge is done. I'll keep you posted.

Thanks!

@misterdjules misterdjules added this to the 0.11.15 milestone Jan 9, 2015
@misterdjules
Copy link

Adding it to the 0.11.15 milestone because it's been reviewed and tested, so it should be quick and easy to land.

@sam-github
Copy link
Author

Didn't see your comment, I'll drop the else.

piscisaureus pushed a commit to nodejs/node that referenced this pull request Jan 10, 2015
For shared handles that do not get connection close messages (UDP/dgram
is the only example of this), cluster must not assume that a port
listened on by one worker is listened on by all workers.

PR-URL: nodejs/node-v0.x-archive#8642
Reviewed-by: Bert Belder <bertbelder@gmail.com>
piscisaureus pushed a commit to nodejs/node that referenced this pull request Jan 10, 2015
Workers that are already disconnected but not yet exited should not be
disconnected, trying to do so raises exceptions.

PR-URL: nodejs/node-v0.x-archive#8642
Reviewed-by: Bert Belder <bertbelder@gmail.com>
@piscisaureus
Copy link

lgtm, landed in io.js: nodejs/node@a80b977...092c224
Thanks Sam!

For shared handles that do not get connection close messages (UDP/dgram
is the only example of this), cluster must not assume that a port
listened on by one worker is listened on by all workers.
Workers that are already disconnected but not yet exited should not be
disconnected, trying to do so raises exceptions.
@misterdjules
Copy link

@sam-github Thanks, landed in f260ef8 and 85360f0 (I switched the order of the commits as 85360f0 depends on f260ef8). I also slightly reworded the commit message for 85360f0, which in my opinion now better reflects what the change does.

Thank you very much again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants