cluster: centralize removal from workers list. #8193

misterdjules · 2014-08-16T00:36:23Z

Currently, cluster workers can be removed from the workers list in three
different places:

In the exit event handler for the worker process.
In the disconnect event handler of the worker process.
In the disconnect event handler of the cluster master.

However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.

Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.

We could either remove the associated handles every time we remove a
worker from the list, or remove workers from the list in only one
handler. This commit implements the latter.

Fixes #8191 and #8192.

misterdjules · 2014-08-18T16:57:39Z

@sam-github Since you seem to know the cluster module very well, I thought I'd put you in the loop. If you had some time to look at this change that would be great! Thank you!

sam-github · 2014-08-18T18:22:41Z

The change will need a doc change, since current behaviour is documented.

I've been bitten by the "when exactly does cluster.workers[id] return undefined" problem before (I saved an ID, sent the worker a signal, then if it didn't exit in some period, tried to look up the worker by ID... and it wasn't there).

So, I like the idea, but I think its really important that the ordering of event emissions and deletion of worker is documented and consistent. As-is, it appears to me that my docs are wrong. The worker is deleted AFTER the worker disconnect even is emitted, but BEFORE the cluster .disconnect event is fired (well, technically, deleted in the first cluster disconnect listener, the user-provided ones will always be after clusters listener). Do you read it the same way?

Anyhow, for the sake of guaranteeing the state of cluster.workers at the time that either disconnect event occurs, what do you think of doing deleting the worker, and its handles at https://github.com/joyent/node/blob/master/lib/cluster.js#L323, before either the worker or cluster disconnect event is emitted? Basically removing this handler, https://github.com/joyent/node/blob/master/lib/cluster.js#L348, and moving its code to before line 323?

misterdjules · 2014-08-18T18:56:47Z

@sam-github Yes, I think we read the code the same way. However, one thing you didn't mention is that the worker can also sometimes be deleted after the worker exit event is emitted, which can be before the worker disconnect event is emitted.

I agree with your suggestion about removing the handler for the 'disconnect' event on the cluster's master, and moving its implementation to the worker's disconnect handler.

However, I think we should also execute the same code in the worker's exit event handler for the reason mentioned above and if we want to be consistent with the documentation.

What do you think?

sam-github · 2014-08-18T19:39:58Z

I'm worried about race conditions, that a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

I guess the question is: do we want the worker removed from the list as soon as possible, or as late as possible? ASAP means on first disconnect OR exit, and as late as possible is when exit, disconnect (and perhaps 'close'?) have occurred.

Since you can be disconnected, but not exited, indefinitely its possible for a Worker process to be alive (not-exited) well past when the disconnect event is observed in the master.

Actually, the whole disconnect thing is a mess, what it does and what is reasonable and predictable are quite far from each other. :-( So, I'm not sure what to say.

Obvious backwards compat bug-fix is to do as you suggest: make sure the handle delete code is run on exit as well as disconnect, and remove the duplicate delete in the worker.on disconnect, and the cluster.on disconnect.

misterdjules · 2014-08-19T00:47:07Z

@sam-github I'm not sure to understand what you mean when you say that
a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

Regarding the rest of your comment, is it really a problem that a worker can stay alive well past the disconnected event has been emitted? I guess in this case even if the process itself is still running, it's just not considered as part of the cluster anymore, and it's thus safe to not have it in the cluster.workers list.

As suggested in private, maybe let's discuss this on {Skype,Hangouts} and post our conclusions here when we're done?

Thank you!

misterdjules · 2014-08-19T23:41:35Z

@sam-github I updated the PR according to our earlier discussion. The comments and the documentation change pretty much summarize the content of this discussion, so I won't report it here.

One thing I'm not sure about is adding isDead() and isDisconnected() to Worker's prototype. If that's an issue, we can make these two functions internal.

sam-github · 2014-08-21T16:00:00Z

doc/api/cluster.markdown

+_and_ exited. The order between these two events cannot be determined in
+advance. However, it is guaranteed that the
+removal from the cluster.workers list happens before the last of
+`'disconnect'` and `'exit'` events is emitted.


I'd express as "before last 'disconnect' or 'exit' event is emitted.", and wrap to 80 columns.

Thanks @sam-github, fixed in the updated PR!

sam-github · 2014-08-21T16:17:34Z

LGTM

I don't have an opinion on whether adding more private/undocumented methods to Worker is good or bad, better to ask somebody who knows more node core history, what has or has not caused problems in the past. Its definitely safer to go the c++y way, and not add as a method anything that can be expressed as a private function, but what's the point of OO if you can't add methods to objects?

I noticed a pre-existing weirdness... final worker.state is either 'dead' or 'disconnected', depending on which event happened last. This strikes me as odd. Even more strangely, until this change, I can find no evidence that the worker state was actually used... some trouble is gone through to assign it to various values, but no comparisons are made to it (your code is the first to use it, and least in cluster.js). This implies that it exists to be public and used elsewhere, but its undocumented. Hm. A mystery.

Fwiw, it would be possible to delete the (apparently unused) .state, and instead write isDead() as this.process.exitCode != null || this.process.signalCode != null, and isDisconnected() as !this.process.connected, reducing the amount double tracking of state.

sam-github · 2014-08-21T17:21:15Z

lib/cluster.js

@@ -595,7 +633,7 @@ function workerInit() {

  Worker.prototype.destroy = function() {
    this.suicide = true;
-    if (!process.connected) process.exit(0);
+    if (!this.isCOnnected()) process.exit(0);


"COnnected"? You will find this when you test.. :-)

@sam-github Good catch. For some reason I haven't had the time to investigate, all tests pass on my computer with this typo. Do they pass on yours too?

@sam-github Fixed the typo. I will check why the tests didn't catch the typo and update tests if needed.

I didn't run the tests, just read the diff, but I would bet the code isn't
covered. You need to call kill or destroy in a worker to reach it, quite
unusual. Particularly since this function accidentally became partially
undocumented when destroy() was re-named to kill()...

@sam-github Created a new issue here: #8223.

misterdjules · 2014-08-21T17:27:30Z

@sam-github Thank you Sam. Indeed, my implementation of Worker.prototype.isDead and Worker.prototype.isDisconnected was broken. They wouldn't be usable outside of this specific use case, which would completely defeat the purpose of adding them to the prototype.

In the latest update to this PR, I changed their implementation so that they don't rely on the state property, which as you said can't be 'dead' and 'disconnected', but instead on the process' state. I also renamed isDisconnected to isConnected and updated the implementation accordingly.

If we decide to leave them in the Worker's prototype, the documentation will need to be updated to reflect that before merging these changes. We will also need to provide some tests for the two newly added methods. I will wait for feedback from the core team before doing that.

Thanks for the review @sam-github!

/cc @tjfontaine

indutny · 2014-08-27T10:24:45Z

LGTM, please add the tests and docs.

Currently, cluster workers can be removed from the workers list in three different places: - In the exit event handler for the worker process. - In the disconnect event handler of the worker process. - In the disconnect event handler of the cluster master. However, handles for a given worker are cleaned up only in one of these places: in the cluster master's disconnect event handler. Because these events happen asynchronously, it is possible that the workers list is empty before we even clean up one handle. This makes the assert that makes sure that no handle is left when the workers list is empty fail. This commit removes the worker from the cluster.workers list only when the worker is dead _and_ disconnected, at which point we're sure that its associated handles are cleaned up. Fixes nodejs#8191 and nodejs#8192.

misterdjules · 2014-08-27T19:40:22Z

src/node.js

@@ -587,7 +587,7 @@
  };

  startup.processKillAndExit = function() {
-    process.exitCode = 0;


Initializing process.exitCode to 0 here doesn't seem to be consistent with how it's used elsewhere in the code. It means that worker.process.exitCode is 0 in a worker when it starts executing, although it has not exited yet.

The convention in the rest of the code seems to be that process.exitCode is undefined until the process has exited.

With this change, all tests pass, but I understand it could have a big impact, so I'd rather highlight it now so that we consider it carefully.

misterdjules · 2014-08-27T19:41:18Z

@indutny Updated my changes, ready to review again. Thank you!

indutny · 2014-09-02T18:19:19Z

Landed in 90d1147, thank you!

Nodejs-Jenkins added pr labels Aug 16, 2014

misterdjules force-pushed the fix-issue-8191 branch 2 times, most recently from 51f42c1 to a686d75 Compare August 19, 2014 23:39

sam-github reviewed Aug 21, 2014
View reviewed changes

misterdjules force-pushed the fix-issue-8191 branch 2 times, most recently from 7b432fb to 2b19fa3 Compare August 21, 2014 17:12

sam-github reviewed Aug 21, 2014
View reviewed changes

misterdjules force-pushed the fix-issue-8191 branch from 2b19fa3 to 90c2918 Compare August 21, 2014 17:35

trevnorris added the cluster label Aug 21, 2014

misterdjules force-pushed the fix-issue-8191 branch from 90c2918 to f5d8660 Compare August 27, 2014 19:36

misterdjules reviewed Aug 27, 2014
View reviewed changes

indutny closed this Sep 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: centralize removal from workers list. #8193

cluster: centralize removal from workers list. #8193

misterdjules commented Aug 16, 2014

misterdjules commented Aug 18, 2014

sam-github commented Aug 18, 2014

misterdjules commented Aug 18, 2014

sam-github commented Aug 18, 2014

misterdjules commented Aug 19, 2014

misterdjules commented Aug 19, 2014

sam-github Aug 21, 2014

misterdjules Aug 21, 2014

sam-github commented Aug 21, 2014

sam-github Aug 21, 2014

misterdjules Aug 21, 2014

misterdjules Aug 21, 2014

sam-github Aug 21, 2014

misterdjules Aug 21, 2014

misterdjules commented Aug 21, 2014

indutny commented Aug 27, 2014

misterdjules Aug 27, 2014

misterdjules commented Aug 27, 2014

indutny commented Sep 2, 2014

cluster: centralize removal from workers list. #8193

cluster: centralize removal from workers list. #8193

Conversation

misterdjules commented Aug 16, 2014

misterdjules commented Aug 18, 2014

sam-github commented Aug 18, 2014

misterdjules commented Aug 18, 2014

sam-github commented Aug 18, 2014

misterdjules commented Aug 19, 2014

misterdjules commented Aug 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-github commented Aug 21, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misterdjules commented Aug 21, 2014

indutny commented Aug 27, 2014

Choose a reason for hiding this comment

misterdjules commented Aug 27, 2014

indutny commented Sep 2, 2014