-
Notifications
You must be signed in to change notification settings - Fork 7.3k
cluster: centralize removal from workers list. #8193
Conversation
@sam-github Since you seem to know the cluster module very well, I thought I'd put you in the loop. If you had some time to look at this change that would be great! Thank you! |
The change will need a doc change, since current behaviour is documented. I've been bitten by the "when exactly does So, I like the idea, but I think its really important that the ordering of event emissions and deletion of worker is documented and consistent. As-is, it appears to me that my docs are wrong. The worker is deleted AFTER the worker disconnect even is emitted, but BEFORE the cluster .disconnect event is fired (well, technically, deleted in the first cluster disconnect listener, the user-provided ones will always be after clusters listener). Do you read it the same way? Anyhow, for the sake of guaranteeing the state of cluster.workers at the time that either disconnect event occurs, what do you think of doing deleting the worker, and its handles at https://github.com/joyent/node/blob/master/lib/cluster.js#L323, before either the worker or cluster disconnect event is emitted? Basically removing this handler, https://github.com/joyent/node/blob/master/lib/cluster.js#L348, and moving its code to before line 323? |
@sam-github Yes, I think we read the code the same way. However, one thing you didn't mention is that the worker can also sometimes be deleted after the worker exit event is emitted, which can be before the worker disconnect event is emitted. I agree with your suggestion about removing the handler for the 'disconnect' event on the cluster's master, and moving its implementation to the worker's disconnect handler. However, I think we should also execute the same code in the worker's exit event handler for the reason mentioned above and if we want to be consistent with the documentation. What do you think? |
I'm worried about race conditions, that a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list. I guess the question is: do we want the worker removed from the list as soon as possible, or as late as possible? ASAP means on first disconnect OR exit, and as late as possible is when exit, disconnect (and perhaps 'close'?) have occurred. Since you can be disconnected, but not exited, indefinitely its possible for a Worker process to be alive (not-exited) well past when the disconnect event is observed in the master. Actually, the whole disconnect thing is a mess, what it does and what is reasonable and predictable are quite far from each other. :-( So, I'm not sure what to say. Obvious backwards compat bug-fix is to do as you suggest: make sure the handle delete code is run on exit as well as disconnect, and remove the duplicate delete in the worker.on disconnect, and the cluster.on disconnect. |
@sam-github I'm not sure to understand what you mean when you say that Regarding the rest of your comment, is it really a problem that a worker can stay alive well past the disconnected event has been emitted? I guess in this case even if the process itself is still running, it's just not considered as part of the cluster anymore, and it's thus safe to not have it in the cluster.workers list. As suggested in private, maybe let's discuss this on {Skype,Hangouts} and post our conclusions here when we're done? Thank you! |
51f42c1
to
a686d75
Compare
@sam-github I updated the PR according to our earlier discussion. The comments and the documentation change pretty much summarize the content of this discussion, so I won't report it here. One thing I'm not sure about is adding |
_and_ exited. The order between these two events cannot be determined in | ||
advance. However, it is guaranteed that the | ||
removal from the cluster.workers list happens before the last of | ||
`'disconnect'` and `'exit'` events is emitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd express as "before last 'disconnect'
or 'exit'
event is emitted.", and wrap to 80 columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sam-github, fixed in the updated PR!
LGTM I don't have an opinion on whether adding more private/undocumented methods to Worker is good or bad, better to ask somebody who knows more node core history, what has or has not caused problems in the past. Its definitely safer to go the c++y way, and not add as a method anything that can be expressed as a private function, but what's the point of OO if you can't add methods to objects? I noticed a pre-existing weirdness... final Fwiw, it would be possible to delete the (apparently unused) .state, and instead write |
7b432fb
to
2b19fa3
Compare
@@ -595,7 +633,7 @@ function workerInit() { | |||
|
|||
Worker.prototype.destroy = function() { | |||
this.suicide = true; | |||
if (!process.connected) process.exit(0); | |||
if (!this.isCOnnected()) process.exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"COnnected"? You will find this when you test.. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-github Good catch. For some reason I haven't had the time to investigate, all tests pass on my computer with this typo. Do they pass on yours too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-github Fixed the typo. I will check why the tests didn't catch the typo and update tests if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run the tests, just read the diff, but I would bet the code isn't
covered. You need to call kill or destroy in a worker to reach it, quite
unusual. Particularly since this function accidentally became partially
undocumented when destroy()
was re-named to kill()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sam-github Created a new issue here: #8223.
@sam-github Thank you Sam. Indeed, my implementation of In the latest update to this PR, I changed their implementation so that they don't rely on the state property, which as you said can't be 'dead' and 'disconnected', but instead on the process' state. I also renamed If we decide to leave them in the Worker's prototype, the documentation will need to be updated to reflect that before merging these changes. We will also need to provide some tests for the two newly added methods. I will wait for feedback from the core team before doing that. Thanks for the review @sam-github! /cc @tjfontaine |
2b19fa3
to
90c2918
Compare
LGTM, please add the tests and docs. |
Currently, cluster workers can be removed from the workers list in three different places: - In the exit event handler for the worker process. - In the disconnect event handler of the worker process. - In the disconnect event handler of the cluster master. However, handles for a given worker are cleaned up only in one of these places: in the cluster master's disconnect event handler. Because these events happen asynchronously, it is possible that the workers list is empty before we even clean up one handle. This makes the assert that makes sure that no handle is left when the workers list is empty fail. This commit removes the worker from the cluster.workers list only when the worker is dead _and_ disconnected, at which point we're sure that its associated handles are cleaned up. Fixes nodejs#8191 and nodejs#8192.
90c2918
to
f5d8660
Compare
@@ -587,7 +587,7 @@ | |||
}; | |||
|
|||
startup.processKillAndExit = function() { | |||
process.exitCode = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initializing process.exitCode
to 0
here doesn't seem to be consistent with how it's used elsewhere in the code. It means that worker.process.exitCode
is 0
in a worker when it starts executing, although it has not exited yet.
The convention in the rest of the code seems to be that process.exitCode
is undefined
until the process has exited.
With this change, all tests pass, but I understand it could have a big impact, so I'd rather highlight it now so that we consider it carefully.
@indutny Updated my changes, ready to review again. Thank you! |
Landed in 90d1147, thank you! |
Currently, cluster workers can be removed from the workers list in three
different places:
However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.
Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.
We could either remove the associated handles every time we remove a
worker from the list, or remove workers from the list in only one
handler. This commit implements the latter.
Fixes #8191 and #8192.