Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

cluster: centralize removal from workers list. #8193

Closed
wants to merge 1 commit into from

Conversation

misterdjules
Copy link

Currently, cluster workers can be removed from the workers list in three
different places:

  • In the exit event handler for the worker process.
  • In the disconnect event handler of the worker process.
  • In the disconnect event handler of the cluster master.

However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.

Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.

We could either remove the associated handles every time we remove a
worker from the list, or remove workers from the list in only one
handler. This commit implements the latter.

Fixes #8191 and #8192.

@misterdjules
Copy link
Author

@sam-github Since you seem to know the cluster module very well, I thought I'd put you in the loop. If you had some time to look at this change that would be great! Thank you!

@sam-github
Copy link

The change will need a doc change, since current behaviour is documented.

I've been bitten by the "when exactly does cluster.workers[id] return undefined" problem before (I saved an ID, sent the worker a signal, then if it didn't exit in some period, tried to look up the worker by ID... and it wasn't there).

So, I like the idea, but I think its really important that the ordering of event emissions and deletion of worker is documented and consistent. As-is, it appears to me that my docs are wrong. The worker is deleted AFTER the worker disconnect even is emitted, but BEFORE the cluster .disconnect event is fired (well, technically, deleted in the first cluster disconnect listener, the user-provided ones will always be after clusters listener). Do you read it the same way?

Anyhow, for the sake of guaranteeing the state of cluster.workers at the time that either disconnect event occurs, what do you think of doing deleting the worker, and its handles at https://github.com/joyent/node/blob/master/lib/cluster.js#L323, before either the worker or cluster disconnect event is emitted? Basically removing this handler, https://github.com/joyent/node/blob/master/lib/cluster.js#L348, and moving its code to before line 323?

@misterdjules
Copy link
Author

@sam-github Yes, I think we read the code the same way. However, one thing you didn't mention is that the worker can also sometimes be deleted after the worker exit event is emitted, which can be before the worker disconnect event is emitted.

I agree with your suggestion about removing the handler for the 'disconnect' event on the cluster's master, and moving its implementation to the worker's disconnect handler.

However, I think we should also execute the same code in the worker's exit event handler for the reason mentioned above and if we want to be consistent with the documentation.

What do you think?

@sam-github
Copy link

I'm worried about race conditions, that a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

I guess the question is: do we want the worker removed from the list as soon as possible, or as late as possible? ASAP means on first disconnect OR exit, and as late as possible is when exit, disconnect (and perhaps 'close'?) have occurred.

Since you can be disconnected, but not exited, indefinitely its possible for a Worker process to be alive (not-exited) well past when the disconnect event is observed in the master.

Actually, the whole disconnect thing is a mess, what it does and what is reasonable and predictable are quite far from each other. :-( So, I'm not sure what to say.

Obvious backwards compat bug-fix is to do as you suggest: make sure the handle delete code is run on exit as well as disconnect, and remove the duplicate delete in the worker.on disconnect, and the cluster.on disconnect.

@misterdjules
Copy link
Author

@sam-github I'm not sure to understand what you mean when you say that
a worker could send a number of messages, then exit, master exit event could occur, then message, then message, then disconnect.... and when those messages come in, the worker ID won't be in the worker list.

Regarding the rest of your comment, is it really a problem that a worker can stay alive well past the disconnected event has been emitted? I guess in this case even if the process itself is still running, it's just not considered as part of the cluster anymore, and it's thus safe to not have it in the cluster.workers list.

As suggested in private, maybe let's discuss this on {Skype,Hangouts} and post our conclusions here when we're done?

Thank you!

@misterdjules misterdjules force-pushed the fix-issue-8191 branch 2 times, most recently from 51f42c1 to a686d75 Compare August 19, 2014 23:39
@misterdjules
Copy link
Author

@sam-github I updated the PR according to our earlier discussion. The comments and the documentation change pretty much summarize the content of this discussion, so I won't report it here.

One thing I'm not sure about is adding isDead() and isDisconnected() to Worker's prototype. If that's an issue, we can make these two functions internal.

_and_ exited. The order between these two events cannot be determined in
advance. However, it is guaranteed that the
removal from the cluster.workers list happens before the last of
`'disconnect'` and `'exit'` events is emitted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd express as "before last 'disconnect' or 'exit' event is emitted.", and wrap to 80 columns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sam-github, fixed in the updated PR!

@sam-github
Copy link

LGTM

I don't have an opinion on whether adding more private/undocumented methods to Worker is good or bad, better to ask somebody who knows more node core history, what has or has not caused problems in the past. Its definitely safer to go the c++y way, and not add as a method anything that can be expressed as a private function, but what's the point of OO if you can't add methods to objects?

I noticed a pre-existing weirdness... final worker.state is either 'dead' or 'disconnected', depending on which event happened last. This strikes me as odd. Even more strangely, until this change, I can find no evidence that the worker state was actually used... some trouble is gone through to assign it to various values, but no comparisons are made to it (your code is the first to use it, and least in cluster.js). This implies that it exists to be public and used elsewhere, but its undocumented. Hm. A mystery.

Fwiw, it would be possible to delete the (apparently unused) .state, and instead write isDead() as this.process.exitCode != null || this.process.signalCode != null, and isDisconnected() as !this.process.connected, reducing the amount double tracking of state.

@misterdjules misterdjules force-pushed the fix-issue-8191 branch 2 times, most recently from 7b432fb to 2b19fa3 Compare August 21, 2014 17:12
@@ -595,7 +633,7 @@ function workerInit() {

Worker.prototype.destroy = function() {
this.suicide = true;
if (!process.connected) process.exit(0);
if (!this.isCOnnected()) process.exit(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"COnnected"? You will find this when you test.. :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-github Good catch. For some reason I haven't had the time to investigate, all tests pass on my computer with this typo. Do they pass on yours too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-github Fixed the typo. I will check why the tests didn't catch the typo and update tests if needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run the tests, just read the diff, but I would bet the code isn't
covered. You need to call kill or destroy in a worker to reach it, quite
unusual. Particularly since this function accidentally became partially
undocumented when destroy() was re-named to kill()...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sam-github Created a new issue here: #8223.

@misterdjules
Copy link
Author

@sam-github Thank you Sam. Indeed, my implementation of Worker.prototype.isDead and Worker.prototype.isDisconnected was broken. They wouldn't be usable outside of this specific use case, which would completely defeat the purpose of adding them to the prototype.

In the latest update to this PR, I changed their implementation so that they don't rely on the state property, which as you said can't be 'dead' and 'disconnected', but instead on the process' state. I also renamed isDisconnected to isConnected and updated the implementation accordingly.

If we decide to leave them in the Worker's prototype, the documentation will need to be updated to reflect that before merging these changes. We will also need to provide some tests for the two newly added methods. I will wait for feedback from the core team before doing that.

Thanks for the review @sam-github!

/cc @tjfontaine

@indutny
Copy link
Member

indutny commented Aug 27, 2014

LGTM, please add the tests and docs.

Currently, cluster workers can be removed from the workers list in three
different places:
- In the exit event handler for the worker process.
- In the disconnect event handler of the worker process.
- In the disconnect event handler of the cluster master.

However, handles for a given worker are cleaned up only in one of these
places: in the cluster master's disconnect event handler.

Because these events happen asynchronously, it is possible that the
workers list is empty before we even clean up one handle. This makes
the assert that makes sure that no handle is left when the workers
list is empty fail.

This commit removes the worker from the cluster.workers list only when
the worker is dead _and_ disconnected, at which point we're sure that
its associated handles are cleaned up.

Fixes nodejs#8191 and nodejs#8192.
@@ -587,7 +587,7 @@
};

startup.processKillAndExit = function() {
process.exitCode = 0;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing process.exitCode to 0 here doesn't seem to be consistent with how it's used elsewhere in the code. It means that worker.process.exitCode is 0 in a worker when it starts executing, although it has not exited yet.

The convention in the rest of the code seems to be that process.exitCode is undefined until the process has exited.

With this change, all tests pass, but I understand it could have a big impact, so I'd rather highlight it now so that we consider it carefully.

@misterdjules
Copy link
Author

@indutny Updated my changes, ready to review again. Thank you!

@indutny
Copy link
Member

indutny commented Sep 2, 2014

Landed in 90d1147, thank you!

@indutny indutny closed this Sep 2, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants