-
Notifications
You must be signed in to change notification settings - Fork 7.3k
cluster: don't assert if worker has no handles #8642
cluster: don't assert if worker has no handles #8642
Conversation
@misterdjules this is a regression introduced in 0.11, trivial to repro and tiny to fix, do you have ability to get some motion on it? |
At the very least it could be tagged as a bug. |
Thank you @sam-github for the PR! A few observations:
Other minor suggestions:
|
Thanks @misterdjules, I will fix this up. |
I added a possible comment, but honestly, I think its the original assert that needed the comment, the code WITHOUT the comment is pretty clear. A worker is being removed from a handle, but what reason is there to expect every worker to listen on EVERY handle? For handles that a worker didn't listen on, no need to remove it, I patched up the other small things. Alas, as I don't have a windows node build env ATM, it'll take me some time tomorrow to get my dog-slow windows vm to grind out a build. |
@sam-github Thanks! Regarding the comment, I agree with you that it's actually overkill. My initial suggestion mostly came from the fact that I revisit the cluster's implementation only once in a while, and every time It takes me some time to remember how it all fits together. Sorry for that. As for the Windows build, please let me know if you need any help with that, I have a pretty good Windows 7 VM running on my laptop. |
Actually, I think I know what the deal is. dgram isn't supported by cluster on windows, and nothing but dgram uses the shared handle on linux.... so by default there isn't a single socket type that can be used to make a portable test, though the fix applies to any use of shared handle. I'm going to force everything to be shared and use tcp, which should reliably show both the shared handle bug,and the fix, on windows and linux. |
I spent about an hour trying to find a way to make this work on Windows, since dgram isn't supported on windows, and just made a mess. I'll try to get time to look at it again tomorrow, hopefully with all the Americans on vacation it will be quiet. |
OK, I think I found my confusion. I thought I could trigger the SharedHandle to be used on linux and Windows by disabling round-robin, but that didn't work. It appears that all socket types other than dgram are not effected because there is some extra handle close messages that occur, but dgram doesn't do that. So, it only effects dgram, and dgram isn't supported on Windows with cluster... so I'm comfortable saying this is a non-Windows test. |
@misterdjules PTAL |
While taking another look at it, I found that running the newly added test ( If understand correctly, I think that we want the test to behave such as the worker that doesn't share the handle disconnects first, so that the master attempts to remove it from the handle. If the non-sharing worker disconnects after the sharing worker, then the test passes even without the fix. I suggest maybe rewriting the fix like following:
@sam-github What do you think? |
|
||
if (cluster.isMaster) { | ||
var messages = 0; | ||
var ports = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are messages
and ports
used in this test?
@sam-github Also, running the test modified as mentioned above but with the fix that removes the assert in
Here's how I fixed it locally:
However, I want to validate with you that the test as currently written in this PR needs to be fixed before discussing about that error. Thank you! |
I agree, the test had a scheduling order dependency. It was always working (failing without my fix) on my machine, but I can see why that would be fluke. I added a stripped down test for the bug exposed by disconnect before exit. |
PTAL @misterdjules |
// disconnected worker would error. | ||
worker.on('disconnect', cluster.disconnect); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the else block here?
Thanks again @sam-github! It does look good to me. One last thing, could you please elaborate on:
? I'm not sure I understand exactly how and when the "extra handle close close messages" happen for non-dgram. Thank you! |
@misterdjules I'm sorry, I was confused. The ordering dependency of the tests, wherein they always passed for dgram, but always failed for tcp, was confusing me. I wrote a tcp version that forces use of SharedHandle (non-round robin) and renamed the existing tests. I've confirmed the tests pass on linux and windows, and fail without my changes, on linux and windows (dgram is skipped on windows, though, because node cluster doesn't support dgram on windows). |
@sam-github No worries, and thanks for the update. I don't know if you got that on IRC, but I'm off until January 5th, so I probably won't have time to review it until then. |
@sam-github Did you have a chance to look at my comment? Otherwise LGTM. Depending on where we are with the merge from v0.10 to v0.12, we'll land it before or after the merge is done. I'll keep you posted. Thanks! |
Adding it to the 0.11.15 milestone because it's been reviewed and tested, so it should be quick and easy to land. |
Didn't see your comment, I'll drop the else. |
For shared handles that do not get connection close messages (UDP/dgram is the only example of this), cluster must not assume that a port listened on by one worker is listened on by all workers. PR-URL: nodejs/node-v0.x-archive#8642 Reviewed-by: Bert Belder <bertbelder@gmail.com>
Workers that are already disconnected but not yet exited should not be disconnected, trying to do so raises exceptions. PR-URL: nodejs/node-v0.x-archive#8642 Reviewed-by: Bert Belder <bertbelder@gmail.com>
lgtm, landed in io.js: nodejs/node@a80b977...092c224 |
For shared handles that do not get connection close messages (UDP/dgram is the only example of this), cluster must not assume that a port listened on by one worker is listened on by all workers.
Workers that are already disconnected but not yet exited should not be disconnected, trying to do so raises exceptions.
As it says.