Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

[cluster.js] removeHandlesForWorker() on both 'exit' and 'disconnect' events. #9418

Closed
wants to merge 2 commits into from

Conversation

jshkurti
Copy link

[cluster.js][worker] Calls removeHandlesForWorker(worker) on both 'exit' and 'disconnect' event instead of just 'disconnect' because sometimes it happens that 'disconnect' event isn't emitted at all in which case some handles will be left and when all workers will be dead an exception raised "AssertionError: Resource leak detected".

Hello, I have previously posted this issue : #9409
I finally came up with this piece of code which intentionally triggers the excpetion :

var http    = require('http');
var cluster = require('cluster');

if (cluster.isMaster) {

  cluster.on('disconnect', function(worker, code, signal) {
    console.log('worker ' + worker.process.pid + ' disconnected');
  });
  cluster.on('exit', function(worker, code, signal) {
    console.log('worker ' + worker.process.pid + ' died');
  });

  var worker = cluster.fork();
  worker.process.removeAllListeners('disconnect');

  setInterval(function keepAlive() {}, 1000);

} else {
  console.log('Hello, Im the worker with pid', process.pid);

  setTimeout(function() {
    process.exit(0);
  }, 2000);

  http.createServer(function(req, res) {
    res.writeHead(200);
    res.end("hello world\n");
  }).listen(8000);
}

And finally I found the solution which is to double check that the handles are removed when a worker dies even if it does not emit the 'disconnect' event but only the 'exit' event as it happens sometimes.
Calling removeHandlesForWorker() twice isn't much of a problem because it does nothing if the worker has already been removed from that handle.
Thanks.

[cluster.js][worker] Calls �����removeHandlesForWorker(worker) on both 'exit' and 'disconnect' event instead of just 'disconnect' because sometimes it happens that 'disconnect' event isn't emitted at all in which case some handles with be left and when all workers will be dead an exception raised "AssertionError: Resource leak detected".
@jshkurti
Copy link
Author

@misterdjules @sam-github @tjfontaine : Any thoughts on this ?

@misterdjules
Copy link

@jshkurti Thank you for the pull-request! For now, most of the team is focused on releasing v0.12.1, which is why it's going to take some time to review it. It's at the top of my review list though, so I'll get there soon.

@sam-github
Copy link

A couple comments.

  1. this isn't a reproduction... its deliberately damaging the cluster module by removing all listeners for disconnect... I could add a line to remove all exit listeners, too, and then things would still be "broken"
  2. is it possible that you actually have code that is calling removeAllListeners, and that you are accidently provoking this scenario?
  3. when child processes exit, the o/s closes their IPC pipes. unquestionably. if disconnect is not firing, I can only think of a few reasons why:
    • exit before the ipc channel was made
    • damage to the event system (what you are demoing above)
    • .. some other bug in node

Figuring out why disconnect isn't occurring should be the focus, I think, but I'll look carefully at your patch to see if it is safe.

@jshkurti
Copy link
Author

Thank you for the quick response :)

1. this isn't a reproduction... its deliberately damaging the cluster module by removing all listeners for disconnect... I could add a line to remove all exit listeners, too, and then things would still be "broken"

I agree, it was the only way to trigger the exception on purpose though.
In production it happens quite randomly when a worker dies.

2. is it possible that you actually have code that is calling removeAllListeners, and that you are accidently provoking this scenario?

No, I double-checked everything. As I said above it happens quite randomly. Let's say once in 50 restarts. For some reason the worker dies without firing 'disconnect' or maybe 'disconnect' is fired after 'exit'.

I updated the code :

worker.process.once('exit', function(exitCode, signalCode) {
  /*
  * Remove the handles associated with this
  * worker a second time just in case 'disconnect'
  * event isn't emitted.
  */
  if (worker.state != 'disconnected')
    removeHandlesForWorker(worker);

to avoid redundant calls to removeHandlesForWorker() and thus only call it if 'exit' event is fired before 'disconnect'.

@gzurbach
Copy link

gzurbach commented Jun 9, 2015

Any update on this? We also use clusters and every few hours or so all Node processes are down because of this bug. We did not have this issue last week with 0.10.

@jshkurti
Copy link
Author

jshkurti commented Jun 9, 2015

+1

You can use this workaround for now :
#9409 (comment)

@gzurbach
Copy link

gzurbach commented Jun 9, 2015

Thanks I will look at this today. I was also considering using PM2 and stop managing my own clusters but I'd rather not add another dependency on my servers.

@jasnell
Copy link
Member

jasnell commented Aug 15, 2015

@jshkurti @gzurbach ... how would you like to proceed on this one? If this is something that needs to be fixed in v0.12, then opening a new PR targeted there would be best. Otherwise, this may need to be moved over to http://github.com/nodejs/node master.

@nullivex
Copy link

nullivex commented Oct 7, 2015

I am still seeing this on 4.1.2

Stopping mock cluster...

assert.js:89
  throw new assert.AssertionError({
  ^
AssertionError: Resource leak detected.
    at removeWorker (cluster.js:328:9)
    at ChildProcess.<anonymous> (cluster.js:348:34)
    at ChildProcess.g (events.js:260:16)
    at emitTwo (events.js:87:13)
    at ChildProcess.emit (events.js:172:7)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:200:12)
      2) "after all" hook

@alsotang
Copy link

os: Linux VM_56_101_centos 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
node version: 4.2.1

Each time I kill a worker, the master exit together with AssertionError: Resource leak detected.

the cluster file is simple:

var cluster = require('cluster');
var http = require('http');

var app = require('../app');

//var numCPUs = require('os').cpus().length;
var numCPUs = 1; // 等离线统计分离出来后, 再将核数调整上去


var port = process.env.PORT || '3000';

if (cluster.isMaster) {
    // Fork workers.
    for (var i = 0; i < numCPUs; i++) {
        cluster.fork();
    }

    cluster.on('fork', function (worker) {
        console.log('[%s] [worker:%d] new worker start', Date(), worker.process.pid);
    })

    cluster.on('exit', function (worker) {
        console.error('[%s] [master:%s] wroker:%s disconnect, suicide: %s, state: %s.',
            Date(), process.pid, worker.process.pid, worker.suicide, worker.state);
        setTimeout(function () {
            console.log('restart worker...')
            cluster.fork();
        }, 1000)
    })
} else {
    http.createServer(app).listen(port);
}

@jshkurti
Copy link
Author

nodejs/node#3510

@jshkurti jshkurti closed this Oct 30, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants