Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel/test-child-process-exec-timeout not cleaning up child process on win7 #11052

Closed
BethGriggs opened this issue Jan 28, 2017 · 24 comments
Closed
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@BethGriggs
Copy link
Member

  • Version: Seen on 6.9.4
  • Platform: Windows (seen on win7)
  • Subsystem: Test

The test parallel/test-child-process-exec-timeout is not always cleaning up its' child processes. When I run the test 10 times (tools/test.py parallel/test-process-child-exec-timeout), I get around 4-5 left over child processes.

Occasionally when the processes are left lying around, the parallel test suite 'hangs' on the 1157th test in the parallel test suite. The test does not timeout with the test runner in this instance - it will hang permanently and the test suite will never complete.

/cc @cjihrig @gibfahn

screen shot 2017-01-28 at 13 37 17

@gibfahn gibfahn added windows Issues and PRs related to the Windows platform. child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. labels Jan 28, 2017
@gibfahn
Copy link
Member

gibfahn commented Jan 28, 2017

cc/ @nodejs/platform-windows

@gibfahn
Copy link
Member

gibfahn commented Jan 28, 2017

Looks like the test was added in #9208.

Based on that also cc/ @santigimeno

@cjihrig
Copy link
Contributor

cjihrig commented Jan 29, 2017

@BethGriggs does the test pass, even though the child processes are left behind?

@gibfahn
Copy link
Member

gibfahn commented Jan 29, 2017

@cjihrig Yes it does pass, you can run tools/test.py --repeat 10 parallel/test-child-process-exec-timeout and it will finish fine, it's only when you do a full test run on a machine that already has a bunch of test-child-process-exec-timeout processes lying around that one of the tests then hangs indefininitely.

@gdams
Copy link
Member

gdams commented Jan 31, 2017

@gibfahn when I was running I was able to recreate the issue running tools/test.py --repeat 100 parallel/test-child-process-exec-timeout with no processes lying around at the start.

@BethGriggs
Copy link
Member Author

BethGriggs commented Jan 31, 2017

The last parallel run I did ended up with three tests hanging - test-child-process-flush-stdio, test-child-process-exec-stdout-stderr-data-string, and test-child-process-exit-code. These tests all hung until I manually killed a few of the test-child-process-exec-timeout child processes.

There seems to be two issues here:

  1. parallel/test-child-process-exec-timeout is not always cleaning up its child processes.
  2. The test runner is not timing out when these tests are hanging. By default the tests should be timing out after 60seconds, as shown in the screenshot it reached 290seconds (which is when I manually killed some of the leftover child processes).

screen shot 2017-01-31 at 12 31 43

To reproduce:

  1. Run tools\test.py --repeat 10 parallel/test-child-process-exec-timeout to create some leftover processes.
  2. Run tools\test.py -J -p tap parallel and the parallel suite will hang about 1 in every 5 times.

@sam-github
Copy link
Contributor

It's too bad exec doesn't allow an ipc channel to be attached to the child.

It is possible that the test can be rewritten so that the child either replaces or augments its internal timeout with a tcp connection to its parent process, and a handler that self-exits when that connection cannot be made, or is lost - this will limit the lifetime of the child to just a bit more than the parent - right now it looks like it won't terminate if its parent terminates.

@cjihrig
Copy link
Contributor

cjihrig commented Feb 3, 2017

Is it possible to identify which of the three child processes is the problematic one (the test could be modified to pass an additional unique argument to each child)?

Shouldn't the child processes be exiting after their setTimeout() elapses? Could something be going wrong with common.platformTimeout() on Windows? If the timeout is working properly, would adding a forced process.exit() in the child after the console.error() call work?

@sam-github
Copy link
Contributor

sam-github commented Mar 31, 2017

This test is pretty subtle. For one thing, node is not the "child" process that is getting killed, the shell is. That means that node never gets signalled, it always times out, and attempts to write its output. I couldn't figure out how it ever passed, even on Linux, so I read child_process. It turns out that the timeout implementation has an undocumented behaviour: before it signals its child (the shell) it destroys the stdout/stderr pipes from the shell. Its because of this pipe destruction that the output is not seen, not because the the child is killed before the timeout. As I understand it, bash on OS X has a unique behaviour where it signals its pgrp on exit causing its child (the tests grandchild) to die, so that's a bit different, but also works.

This reason the test works is so non-obvious that even though it does (on unix), I don't think it works as the author intended.

As for Windows, the description above is confusing. What it sounds like is happening from talking to Jeremiah on my team (EDIT: @Akpotohwo) is that the test never completes, and is eventually killed by the test runner, so the real problem is the test never completing - not that children leak (thats a follow-on effect of the testrunner killing the top-level node process). However, he also says that only the first exec with timeout using SIGTERM is leaking. That is odd, because on Windows SIGTERM and SIGKILL are handled identically (see uv__kill in deps/uv/src/windows), though after the child terminates UV lies and claims it was killed with a signal (it wasn't, Windows has no signals, but the lie makes node more uniform on Windows and Linux).

I don't know why this is, but I'm wondering if there is something about the destruction of the stdout/err pipes that is confusing the child (remember, the child is cmd.exe, not node), or possibly confusing the node grand-child, and preventing its exit on timeout.

@santigimeno
Copy link
Member

For one thing, node is not the "child" process that is getting killed, the shell is. That means that node never gets signalled, ...

This is what we observed (on Linux) when looking into test-child-process-exec-kill-throws flakiness. So IIUIC killing a process launched with exec() will never kill the node grandchild process but just the shell, right? If that's the case could it be a good idea documenting that as I think people would expect it do so? Is there a way to propagate the signal to the node grandchild?

@sam-github
Copy link
Contributor

sam-github commented Mar 31, 2017

There is a big note: https://nodejs.org/api/child_process.html#child_process_child_process_execfile_file_args_options_callback (EDIT: the note is above that link), but it maybe doesn't go far enough in describing the implications of this. I've seen people comlain that child.pid, for example, isn't the pid of the command they ran, but of the shell, and people often complain about the kill behaviour. It's in theory possible to kill groups of processes (on unix, not sure about windows), but you have to put them in a group, and its not clear that node should do that by itself, but maybe it should. Worth some thought.

@sam-github
Copy link
Contributor

And that the child pipes are destroyed should be documented as well.

@Trott
Copy link
Member

Trott commented Jul 31, 2017

Can someone confirm if this is still an issue on Windows? I'm guessing it is, but the test has had a few changes since the problem was reporting, so ¯\(ツ)

@BethGriggs
Copy link
Member Author

@Trott, I'll see if I can reproduce with the latest test changes

@BethGriggs
Copy link
Member Author

BethGriggs commented Aug 1, 2017

Processes are still being leftover with the latest test changes. 10 runs of parallel/test-child-process-exec-timeout results in about 6-7 processes left. I tested with master, 8.x, and 6.x.

However, I have not been able to reproduce the second part of the issue where the parallel test suite hangs until I kill some of the remaining processes.

@Trott
Copy link
Member

Trott commented Apr 30, 2018

I hate to ask every six months and just hope that the problem goes away, but.... Is this still an issue?

@gireeshpunathil
Copy link
Member

yes, I saw this yesterday in a local run. And I believe this could potentially cascading to other failures as well, for example sequential/test-benchmark-http which we saw in #20389. I can spare some time next week on debugging this.

@Trott
Copy link
Member

Trott commented May 7, 2018

FWIW, I'm able to emulate the leaving-a-process-lying-around on macOS if I run enough of these in parallel.

@gireeshpunathil
Copy link
Member

Thanks @Trott, your simple test case was wonderful, and it brought the hang consistently in macos in couple of iterations, when ran 64 instances altogether.

#ps
  PID TTY           TIME CMD
 1707 ttys000    0:00.66 -bash
39634 ttys000    0:00.14 node 20414.js
39701 ttys000    0:01.23 /usr/local/bin/node /Users/gireeshpunathil/Desktop/support/issues/20414.js chil
 1714 ttys001    0:00.01 -bash
 8182 ttys002    0:00.03 -bash
38346 ttys003    0:00.06 -bash

lldb attach showed something surprising: the child indeed succumbed to SIGSEGV, but it was not terminating the process, instead causing high cpu - probably the crash handler and the tracing agent are running in a tight loop or something like that. I have no clues about that, don't know whether --prof will help here or not.

#lldb 
(lldb) attach 39701
Process 39701 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
    frame #0: 0x00000001000a7c2d node`node::tracing::Agent::Stop() + 9
node`node::tracing::Agent::Stop:
->  0x1000a7c2d <+9>:  cmpb   $0x0, 0x440(%rbx)
    0x1000a7c34 <+16>: je     0x1000a7c69               ; <+69>
    0x1000a7c36 <+18>: movq   0x460(%rbx), %rdi
    0x1000a7c3d <+25>: callq  0x100abf040               ; v8::platform::tracing::TracingController::StopTracing()
Target 0: (node) stopped.

Executable module set to "/usr/local/bin/node".
Architecture set to: x86_64h-apple-macosx.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
  * frame #0: 0x00000001000a7c2d node`node::tracing::Agent::Stop() + 9
    frame #1: 0x0000000100030beb node`node::SignalExit(int) + 25
    frame #2: 0x00007fff5d7faf5a libsystem_platform.dylib`_sigtramp + 26
    frame #3: 0x00007fff5d7ff3e9 libsystem_platform.dylib`_platform_memmove + 1
    frame #4: 0x00007fff5b5ba272 libc++.1.dylib`std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsputn(char const*, long) + 76
    frame #5: 0x00007fff5b5cc2ec libc++.1.dylib`std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > std::__1::__pad_and_output<char, std::__1::char_traits<char> >(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, char const*, char const*, char const*, std::__1::ios_base&, char) + 294
    frame #6: 0x00007fff5b5cc5c9 libc++.1.dylib`std::__1::num_put<char, std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> > >::do_put(std::__1::ostreambuf_iterator<char, std::__1::char_traits<char> >, std::__1::ios_base&, char, unsigned long) const + 305
    frame #7: 0x00007fff5b5c03c6 libc++.1.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char> >::operator<<(unsigned long) + 240
    frame #8: 0x000000010050ce17 node`v8::internal::ComputeFlagListHash() + 295
    frame #9: 0x0000000100032afb node`node::Init(int*, char const**, int*, char const***) + 848
    frame #10: 0x0000000100033294 node`node::Start(int, char**) + 96
    frame #11: 0x0000000100001034 node`start + 52

(lldb) disassemble -f
node`node::tracing::Agent::Stop:
    0x1000a7c24 <+0>:  pushq  %rbp
    0x1000a7c25 <+1>:  movq   %rsp, %rbp
    0x1000a7c28 <+4>:  pushq  %rbx
    0x1000a7c29 <+5>:  pushq  %rax
    0x1000a7c2a <+6>:  movq   %rdi, %rbx
->  0x1000a7c2d <+9>:  cmpb   $0x0, 0x440(%rbx)

(lldb) reg read rbx
     rbx = 0x0000000000000000
(lldb) 

This corresponds to this source line

The crash happened when trying to access the member field started_ which essentially means there is a memory corruption.

#ps -fr
  UID   PID  PPID   C STIME   TTY           TIME CMD
  501 39701 39634   0 10:47AM ttys000    0:54.36 /usr/local/bin/node /Users/gireeshpunathil/Desktop/support/issues/20414.js child
  501 39634     1   0 10:47AM ttys000    0:00.14 node 20414.js
  501 38346 38345   0  9:48AM ttys003    0:00.06 -bash
  501  8182  8181   0 Sun07PM ttys002    0:00.03 -bash
  501  1714  1713   0 Sun12PM ttys001    0:00.01 -bash
  501  1707  1706   0 Sun12PM ttys000    0:00.68 -bash

If I manually kill the child, the parent gets into action with the usual failure message. I guess the python parent in the actual test case also does the same.

#kill -9 39701
#assert.js:77
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: Input A expected to strictly equal input B:
+ expected - actual

- 'SIGKILL'
+ 'SIGTERM'
    at cp.exec (/Users/gireeshpunathil/Desktop/support/issues/20414.js:14:9)
    at ChildProcess.exithandler (child_process.js:289:5)
    at ChildProcess.emit (events.js:182:13)
    at maybeClose (internal/child_process.js:947:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:236:5)

@gireeshpunathil
Copy link
Member

edit: source line is this:

if (!started_)

@gireeshpunathil
Copy link
Member

the Mac spin dump showed that these frames were always there in the 1000 samples taken over 10 seconds! I am trying to make some sense out of it.

1000  start + 52 (node + 4148) [0x100001034]
1000  node::Start(int, char**) + 96 (node + 209556) [0x100033294]
1000  node::Init(int*, char const**, int*, char const***) + 37 (node + 206800) [0x1000327d0]
1000  uv_default_loop + 33 (node + 9613071) [0x10092af0f]
1000  uv_loop_init + 361 (node + 9633437) [0x10092fe9d]
1000  uv_once + 9 (node + 9659457) [0x100936441]
1000  pthread_once + 56 (libsystem_pthread.dylib + 14681) [0x7fff5d804959]
1000  _os_once + 33 (libsystem_platform.dylib + 3840) [0x7fff5d7f9f00]
1000  __pthread_once_handler + 66 (libsystem_pthread.dylib + 14782) [0x7fff5d8049be]
1000  uv__signal_global_init + 47 (node + 9641262) [0x100931d2e]
1000  ??? (node + 22593584) [0x10158c030]
1000  dyld_stub_binder + 282 (libdyld.dylib + 5058) [0x7fff5d57a3c2]
1000  dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 89 (dyld + 17752) [0x1113e2558]
1000  ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const&, void (*)(), void (*)()) + 250 (dyld + 121610) [0x1113fbb0a]
1000  ImageLoaderMachOCompressed::resolve(ImageLoader::LinkContext const&, char const*, unsigned char, long, ImageLoader const**, ImageLoaderMachOCompressed::LastLookup*, bool) + 308 (dyld + 115412) [0x1113fa2d4]
1000  ImageLoaderMachOCompressed::resolveTwolevel(ImageLoader::LinkContext const&, char const*, ImageLoader const*, ImageLoader const*, unsigned int, bool, bool, ImageLoader const**) + 86 (dyld + 114774) [0x1113fa056]
1000  ImageLoader::findExportedSymbolAddress(ImageLoader::LinkContext const&, char const*, ImageLoader const*, int, bool, ImageLoader const**, unsigned long*) const + 50 (dyld + 67152) [0x1113ee650]
1000  ImageLoader::weakBind(ImageLoader::LinkContext const&) + 1204 (dyld + 70078) [0x1113ef1be]
1000  _sigtramp + 26 (libsystem_platform.dylib + 8026) [0x7fff5d7faf5a]
1000  node::SignalExit(int) + 25 (node + 199659) [0x100030beb]
1000  node::tracing::Agent::Stop() + 9 (node + 687149) [0x1000a7c2d]

@bnoordhuis
Copy link
Member

@gireeshpunathil That's #14802.

@gireeshpunathil
Copy link
Member

This is awesome, thanks @bnoordhuis!

with that info, it is now pretty easy to reproduce without a child_process in hand. Insert this patch and just run node.

--- a/deps/uv/src/unix/signal.c
+++ b/deps/uv/src/unix/signal.c
@@ -73,6 +73,7 @@ static void uv__signal_global_init(void) {
      */
     if (pthread_atfork(NULL, NULL, &uv__signal_global_reinit))
       abort();
+    raise(SIGTERM);

this loops between the handler and the Agent:Stop() method with high CPU.

#lldb ./node
(lldb) target create "./node"
Current executable set to './node' (x86_64).
(lldb) r
Process 46709 launched: './node' (x86_64)
Process 46709 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGTERM
    frame #0: 0x00007fff5d6c9fce libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff5d6c9fce <+10>: jae    0x7fff5d6c9fd8            ; <+20>
    0x7fff5d6c9fd0 <+12>: movq   %rax, %rdi
    0x7fff5d6c9fd3 <+15>: jmp    0x7fff5d6c176c            ; cerror_nocancel
    0x7fff5d6c9fd8 <+20>: retq   
Target 0: (node) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGTERM
  * frame #0: 0x00007fff5d6c9fce libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff5d807150 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff5d5d88fe libsystem_c.dylib`raise + 26
    frame #3: 0x0000000100932568 node`uv__signal_global_init at signal.c:76 [opt]
    frame #4: 0x00007fff5d8049be libsystem_pthread.dylib`__pthread_once_handler + 66
    frame #5: 0x00007fff5d7f9f00 libsystem_platform.dylib`_os_once + 33
    frame #6: 0x00007fff5d804959 libsystem_pthread.dylib`pthread_once + 56
    frame #7: 0x0000000100936c2d node`uv_once(guard=<unavailable>, callback=<unavailable>) at thread.c:363 [opt]
    frame #8: 0x00000001009306fe node`uv_loop_init(loop=0x000000010162de40) at loop.c:70 [opt]
    frame #9: 0x000000010092ba13 node`uv_default_loop at uv-common.c:599 [opt]
    frame #10: 0x0000000100030f65 node`node::Init(argc=0x00007ffeefbff99c, argv=0x0000000102500370, exec_argc=0x00007ffeefbff998, exec_argv=0x00007ffeefbff990) at node.cc:4245 [opt]
    frame #11: 0x00000001000319e7 node`node::Start(argc=1, argv=0x0000000102500370) at node.cc:4625 [opt]
    frame #12: 0x0000000100001034 node`start + 52
(lldb) c
Process 46709 resuming
node was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 46709 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x440)
    frame #0: 0x00000001000a1865 node`node::tracing::Agent::Stop(this=0x0000000000000000) at agent.cc:40 [opt]
   37  	}
   38  	
   39  	void Agent::Stop() {
-> 40  	  if (!started_)
   41  	    return;
   42  	  // Perform final Flush on TraceBuffer. We don't want the tracing controller
   43  	  // to flush the buffer again on destruction of the V8::Platform.
Target 0: (node) stopped.
(lldb) f 1
frame #1: 0x000000010002f345 node`node::SignalExit(int) [inlined] node::$_0::StopTracingAgent() at node.cc:326 [opt]
   323 	  }
   324 	
   325 	  void StopTracingAgent() {
-> 326 	    tracing_agent_->Stop();
   327 	  }
   328 	
   329 	  tracing::Agent* GetTracingAgent() const {
(lldb) p tracing_agent_
error: warning: Stopped in a C++ method, but 'this' isn't available; pretending we are in a generic context
error: use of undeclared identifier 'tracing_agent_'

Looks like the agent is absolutely unavailable at this such an early bootup time?

@gireeshpunathil
Copy link
Member

this is fixed through #20637 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

No branches or pull requests

9 participants