Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsnap crashes (SIGSEGV) in GC after snapshot; error stack bug; missing callback #3350

Closed
dckc opened this issue Jun 17, 2021 · 15 comments · Fixed by #3358
Closed

xsnap crashes (SIGSEGV) in GC after snapshot; error stack bug; missing callback #3350

dckc opened this issue Jun 17, 2021 · 15 comments · Fixed by #3358
Assignees
Labels
bug Something isn't working xsnap the XS execution tool

Comments

@dckc
Copy link
Member

dckc commented Jun 17, 2021

Description

The v11:bootstrap vat runs for a while until it calls gc() inside gcAndFinalize(), at which point it crashes with a SIGSEGV.

The following is an excerpt of a stack trace from #2273 (comment)

#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:301
#1  0x000055cddbe73e0e in fxSweep (the=0x55cddc49e4e0)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsMemory.c:1534
#2  0x000055cddbe70fdb in fxCollect (the=0x55cddc49e4e0, theFlag=1)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsMemory.c:251
#3  0x000055cddbdfb538 in fxCollectGarbage (the=0x55cddc49e4e0)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsAPI.c:1717
#4  0x000055cddbeff618 in fx_gc (the=0x55cddc49e4e0) at ../../src/xsnap.c:912
#5  0x000055cddbea336e in fxRunID (the=0x55cddc49e4e0, generator=0x7f23807663d0, count=0)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsRun.c:835

As it turns out, the snapshots are not essential to the problem, but I originally discovered it a la...

$ agoric start --reset -v
...
SwingSet: kernel: delivery problem, terminating vat v11 v11:bootstrap exited due to signal SIGSEGV

vat v1:bank starts by loading the SES shim and supervisor and taking a snapshot; vat v11:bootstrap starts from that snapshot and runs for a while until it calls gc() inside gcAndFinalize(), at which point it crashes with a SIGSEGV.

To Reproduce: replay xsnap recording

You will need node 14 plus:

  1. the recording: xsnap-tests.zip
    • unzip in /tmp/
  2. the replay tool: xs-replay.zip
    • unzip somewhere; you'll get an xs-replay directory.
  3. an xsnap binary:
    • git clone https://github.com/Agoric/agoric-sdk/; cd agoric-sdk; yarn; yarn build
    • export AGORIC_SDK=${PWD}

Then:

$ cd xs-replay
$ make SDK=${AGORIC_SDK}
node --version | grep 14.
v14.16.0
mkdir -p build/bin/lin/debug
cp -rp /.../agoric-sdk/packages/xsnap/build/bin/lin/debug/xsnap build/bin/lin/debug/xsnap
mkdir -p build/bin/lin/release
cp -rp /.../agoric-sdk/packages/xsnap/build/bin/lin/release/xsnap build/bin/lin/release/xsnap
node replay-bin.cjs /tmp/xsnap-tests/1/ /tmp/xsnap-tests/3/
/tmp/xsnap-tests/1/ 00000-options.json : { os: 'Linux', name: 'v1:bank', debug: true, meteringLimit: 0 }
/tmp/xsnap-tests/1/ 1 evaluate
/tmp/xsnap-tests/1/ 2 evaluate
/tmp/xsnap-tests/1/ 3 snapshot
/tmp/xsnap-tests/1/ 00003-snapshot.dat ignoring remaining steps from /tmp/xsnap-tests/1/
/tmp/xsnap-tests/3/ 00000-options.json already spawned; ignoring: {
  os: 'Linux',
  name: 'v11:bootstrap',
  debug: true,
  snapshot: '/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/c5d82df2198c546034d7df17565503156f6e8684be1eb243c507550b3c06c770-load-aNDEbV.xss',
  meteringLimit: 10000000
}
/tmp/xsnap-tests/3/ 1 evaluate
/tmp/xsnap-tests/3/ 2 issueCommand
/tmp/xsnap-tests/3/ 3 reply
/tmp/xsnap-tests/3/ 4 issueCommand
/tmp/xsnap-tests/3/ 5 reply
/tmp/xsnap-tests/3/ 6 reply
/tmp/xsnap-tests/3/ 7 reply
/tmp/xsnap-tests/3/ 8 issueCommand
/tmp/xsnap-tests/3/ 9 reply
/tmp/xsnap-tests/3/ 10 reply
/tmp/xsnap-tests/3/ 11 issueCommand
/tmp/xsnap-tests/3/ 12 reply
/tmp/xsnap-tests/3/ 13 reply
/tmp/xsnap-tests/3/ 14 reply
/tmp/xsnap-tests/3/ 15 reply
/tmp/xsnap-tests/3/ 16 reply
/tmp/xsnap-tests/3/ 17 reply
/tmp/xsnap-tests/3/ 18 reply
/tmp/xsnap-tests/3/ 19 reply
/tmp/xsnap-tests/3/ 20 reply
/tmp/xsnap-tests/3/ 21 reply
/tmp/xsnap-tests/3/ 22 reply
/tmp/xsnap-tests/3/ 23 reply
/tmp/xsnap-tests/3/ 24 reply
/tmp/xsnap-tests/3/ 25 issueCommand
/tmp/xsnap-tests/3/ 26 issueCommand
/tmp/xsnap-tests/3/ 27 issueCommand
/tmp/xsnap-tests/3/ 28 issueCommand
/tmp/xsnap-tests/3/ 29 issueCommand
/tmp/xsnap-tests/3/ 30 issueCommand
/tmp/xsnap-tests/3/ 31 reply
/tmp/xsnap-tests/3/ 32 reply
/tmp/xsnap-tests/3/ 33 reply
/tmp/xsnap-tests/3/ 34 reply
/tmp/xsnap-tests/3/ 35 reply
/tmp/xsnap-tests/3/ 36 reply
/tmp/xsnap-tests/3/ 37 reply
/tmp/xsnap-tests/3/ 38 reply
/tmp/xsnap-tests/3/ 39 issueCommand
/tmp/xsnap-tests/3/ 40 issueCommand
/tmp/xsnap-tests/3/ 41 issueCommand
/tmp/xsnap-tests/3/ 42 reply
/tmp/xsnap-tests/3/ 43 reply
/tmp/xsnap-tests/3/ 44 issueCommand
/tmp/xsnap-tests/3/ 45 reply
/tmp/xsnap-tests/3/ 46 reply
/tmp/xsnap-tests/3/ 47 issueCommand
/tmp/xsnap-tests/3/ 48 reply
/tmp/xsnap-tests/3/ 49 reply
/tmp/xsnap-tests/3/ 50 issueCommand
/tmp/xsnap-tests/3/ 51 issueCommand
/tmp/xsnap-tests/3/ 52 reply
/tmp/xsnap-tests/3/ 53 reply
/tmp/xsnap-tests/3/ 54 reply
/tmp/xsnap-tests/3/ 55 reply
/tmp/xsnap-tests/3/ 56 reply
/tmp/xsnap-tests/3/ 57 reply
/tmp/xsnap-tests/3/ 58 issueCommand
/tmp/xsnap-tests/3/ 59 issueCommand
/tmp/xsnap-tests/3/ 60 issueCommand
/tmp/xsnap-tests/3/ 61 reply
/tmp/xsnap-tests/3/ 62 reply
/tmp/xsnap-tests/3/ 63 reply
/tmp/xsnap-tests/3/ 64 reply
/tmp/xsnap-tests/3/ 65 reply
/tmp/xsnap-tests/3/ 66 reply
/tmp/xsnap-tests/3/ 67 reply
/tmp/xsnap-tests/3/ 68 reply
/tmp/xsnap-tests/3/ 69 reply
/tmp/xsnap-tests/3/ 70 reply
/tmp/xsnap-tests/3/ 71 reply
/tmp/xsnap-tests/3/ 72 reply
/tmp/xsnap-tests/3/ 73 reply
/tmp/xsnap-tests/3/ 74 reply
/tmp/xsnap-tests/3/ 75 reply
/tmp/xsnap-tests/3/ 76 reply
/tmp/xsnap-tests/3/ 77 issueCommand
/tmp/xsnap-tests/3/ 78 reply
/tmp/xsnap-tests/3/ 79 reply
/tmp/xsnap-tests/3/ 80 snapshot
/tmp/xsnap-tests/3/ 81 issueCommand
(node:28438) UnhandledPromiseRejectionWarning: ExitSignal: v1:bank exited due to signal SIGSEGV
    at ChildProcess.<anonymous> ...

Platform Environment

  • what OS are you using? linux
    • the problem has been reproduced on a mac
  • is there anything special/unusual about your platform? 2273-snapstore branch
  • what version of the Agoric-SDK are you using? git describe --tags --always says @agoric/access-token@0.4.1-36-g36c6a80ef

cc @phoddie @warner @dtribble @kriskowal @FUDCo @michaelfig @rowgraus

@dckc dckc added the bug Something isn't working label Jun 17, 2021
@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

@phoddie I'm struggling to reproduce the problem from that wed9_23.zip recording.

wed9_23.zip

created using 002c2e0
based on https://github.com/agoric-labs/xsnap-pub/blob/c2cebe78c723ec4e7ebf452aac39e08aea889d3a/xsnap/sources/xsnap-ava.c

When I run moddable's xsnap replay tool, I get:

xsnap-tests/2021-06-16-21-03-39-091$ xsnap -r xs-snapshots/tmp-ZT5xPL.xss -t
### param-0.js
stack overflow (-1)!
stack overflow

For the replay tool, I'm using

Meanwhile, I have been working on adding recording and playback at the JS level for use with the C code we're currently using: #3338.

@dckc dckc self-assigned this Jun 17, 2021
@dckc dckc added the xsnap the XS execution tool label Jun 17, 2021
@dckc dckc added this to the Testnet: Stress Test Phase milestone Jun 17, 2021
@phoddie
Copy link

phoddie commented Jun 17, 2021

I'm not sure that a JavaScript recording & playback will help. That is almost certainly going to change the runtime behavior. That's why xsnap does recording and playback outside the VM.

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

This JS recording is outside the XS VM; it's JS on the node side. We drive the xsnap binary from a JS API on the node side. I added the recording / playback there.

It seems to work; I can reproduce the SIGSEGV this way:

vat3.zip
https://github.com/Agoric/agoric-sdk/blob/2273-snapstore/packages/xsnap/src/replay.js

~/projects/agoric/agoric-sdk/packages/xsnap
01:13 connolly@jambox$ node -r esm src/replay.js /tmp/xsnap-tests/3/
@@replay {
  storedOpts: {
    os: 'Linux',
    name: 'v11:bootstrap',
    debug: true,
    snapshot: '/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/tmp-jSuiTl.xss',
    meteringLimit: 10000000
  }
}
@@replay { step: '00001-evaluate.dat', seq: 1, kind: 'evaluate' }
@@replay { step: '00002-issueCommand.dat', seq: 2, kind: 'issueCommand' }
@@replay { step: '00003-reply.dat', seq: 3, kind: 'reply' }
@@replay { step: '00004-issueCommand.dat', seq: 4, kind: 'issueCommand' }
@@replay { step: '00005-reply.dat', seq: 5, kind: 'reply' }
@@replay { step: '00006-reply.dat', seq: 6, kind: 'reply' }
@@replay { step: '00007-reply.dat', seq: 7, kind: 'reply' }
@@replay { step: '00008-issueCommand.dat', seq: 8, kind: 'issueCommand' }
@@replay { step: '00009-reply.dat', seq: 9, kind: 'reply' }
@@replay { step: '00010-reply.dat', seq: 10, kind: 'reply' }
@@replay { step: '00011-issueCommand.dat', seq: 11, kind: 'issueCommand' }
@@replay { step: '00012-reply.dat', seq: 12, kind: 'reply' }
@@replay { step: '00013-reply.dat', seq: 13, kind: 'reply' }
@@replay { step: '00014-reply.dat', seq: 14, kind: 'reply' }
@@replay { step: '00015-reply.dat', seq: 15, kind: 'reply' }
@@replay { step: '00016-reply.dat', seq: 16, kind: 'reply' }
@@replay { step: '00017-reply.dat', seq: 17, kind: 'reply' }
@@replay { step: '00018-reply.dat', seq: 18, kind: 'reply' }
@@replay { step: '00019-reply.dat', seq: 19, kind: 'reply' }
@@replay { step: '00020-reply.dat', seq: 20, kind: 'reply' }
@@replay { step: '00021-reply.dat', seq: 21, kind: 'reply' }
@@replay { step: '00022-reply.dat', seq: 22, kind: 'reply' }
@@replay { step: '00023-reply.dat', seq: 23, kind: 'reply' }
@@replay { step: '00024-reply.dat', seq: 24, kind: 'reply' }
@@replay { step: '00025-issueCommand.dat', seq: 25, kind: 'issueCommand' }
@@replay { step: '00026-issueCommand.dat', seq: 26, kind: 'issueCommand' }
@@replay { step: '00027-issueCommand.dat', seq: 27, kind: 'issueCommand' }
@@replay { step: '00028-issueCommand.dat', seq: 28, kind: 'issueCommand' }
@@replay { step: '00029-issueCommand.dat', seq: 29, kind: 'issueCommand' }
@@replay { step: '00030-issueCommand.dat', seq: 30, kind: 'issueCommand' }
@@replay { step: '00031-reply.dat', seq: 31, kind: 'reply' }
@@replay { step: '00032-reply.dat', seq: 32, kind: 'reply' }
@@replay { step: '00033-reply.dat', seq: 33, kind: 'reply' }
@@replay { step: '00034-reply.dat', seq: 34, kind: 'reply' }
@@replay { step: '00035-reply.dat', seq: 35, kind: 'reply' }
@@replay { step: '00036-reply.dat', seq: 36, kind: 'reply' }
@@replay { step: '00037-reply.dat', seq: 37, kind: 'reply' }
@@replay { step: '00038-reply.dat', seq: 38, kind: 'reply' }
@@replay { step: '00039-issueCommand.dat', seq: 39, kind: 'issueCommand' }
@@replay { step: '00040-issueCommand.dat', seq: 40, kind: 'issueCommand' }
@@replay { step: '00041-issueCommand.dat', seq: 41, kind: 'issueCommand' }
@@replay { step: '00042-reply.dat', seq: 42, kind: 'reply' }
@@replay { step: '00043-reply.dat', seq: 43, kind: 'reply' }
@@replay { step: '00044-issueCommand.dat', seq: 44, kind: 'issueCommand' }
@@replay { step: '00045-reply.dat', seq: 45, kind: 'reply' }
@@replay { step: '00046-reply.dat', seq: 46, kind: 'reply' }
@@replay { step: '00047-issueCommand.dat', seq: 47, kind: 'issueCommand' }
@@replay { step: '00048-reply.dat', seq: 48, kind: 'reply' }
@@replay { step: '00049-reply.dat', seq: 49, kind: 'reply' }
@@replay { step: '00050-issueCommand.dat', seq: 50, kind: 'issueCommand' }
@@replay { step: '00051-issueCommand.dat', seq: 51, kind: 'issueCommand' }
@@replay { step: '00052-reply.dat', seq: 52, kind: 'reply' }
@@replay { step: '00053-reply.dat', seq: 53, kind: 'reply' }
@@replay { step: '00054-reply.dat', seq: 54, kind: 'reply' }
@@replay { step: '00055-reply.dat', seq: 55, kind: 'reply' }
@@replay { step: '00056-reply.dat', seq: 56, kind: 'reply' }
@@replay { step: '00057-reply.dat', seq: 57, kind: 'reply' }
@@replay { step: '00058-issueCommand.dat', seq: 58, kind: 'issueCommand' }
@@replay { step: '00059-issueCommand.dat', seq: 59, kind: 'issueCommand' }
@@replay { step: '00060-issueCommand.dat', seq: 60, kind: 'issueCommand' }
@@replay { step: '00061-reply.dat', seq: 61, kind: 'reply' }
@@replay { step: '00062-reply.dat', seq: 62, kind: 'reply' }
@@replay { step: '00063-reply.dat', seq: 63, kind: 'reply' }
@@replay { step: '00064-reply.dat', seq: 64, kind: 'reply' }
@@replay { step: '00065-reply.dat', seq: 65, kind: 'reply' }
@@replay { step: '00066-reply.dat', seq: 66, kind: 'reply' }
@@replay { step: '00067-reply.dat', seq: 67, kind: 'reply' }
@@replay { step: '00068-reply.dat', seq: 68, kind: 'reply' }
@@replay { step: '00069-reply.dat', seq: 69, kind: 'reply' }
@@replay { step: '00070-reply.dat', seq: 70, kind: 'reply' }
@@replay { step: '00071-reply.dat', seq: 71, kind: 'reply' }
@@replay { step: '00072-reply.dat', seq: 72, kind: 'reply' }
@@replay { step: '00073-reply.dat', seq: 73, kind: 'reply' }
@@replay { step: '00074-reply.dat', seq: 74, kind: 'reply' }
@@replay { step: '00075-reply.dat', seq: 75, kind: 'reply' }
@@replay { step: '00076-reply.dat', seq: 76, kind: 'reply' }
@@replay { step: '00077-issueCommand.dat', seq: 77, kind: 'issueCommand' }
@@replay { step: '00078-reply.dat', seq: 78, kind: 'reply' }
@@replay { step: '00079-reply.dat', seq: 79, kind: 'reply' }
@@replay { step: '00080-snapshot.dat', seq: 80, kind: 'snapshot' }
@@replay { step: '00081-issueCommand.dat', seq: 81, kind: 'issueCommand' }
(node:17731) UnhandledPromiseRejectionWarning: ExitSignal: v11:bootstrap exited due to signal SIGSEGV
    at ChildProcess.<anonymous> (/home/connolly/projects/agoric/agoric-sdk/packages/xsnap/src/xsnap.js:116:22)
    at ChildProcess.emit (events.js:210:5)
    at ChildProcess.EventEmitter.emit (domain.js:475:20)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:272:12)

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

I extracted the replay tool from the rest of the agoric sdk. node 14 should be the only dependency (tested on node 14.6 and node 14.7 I think):

it runs like this:

make -k 
node --version | grep 14.
v14.16.0
node replay-bin.cjs /tmp/xsnap-tests/3/
... as above
@@replay { step: '00079-reply.dat', seq: 79, kind: 'reply' }
@@replay { step: '00080-snapshot.dat', seq: 80, kind: 'snapshot' }
@@replay { step: '00081-issueCommand.dat', seq: 81, kind: 'issueCommand' }
(node:12681) UnhandledPromiseRejectionWarning: ExitSignal: v11:bootstrap exited due to signal SIGSEGV
    at ChildProcess.<anonymous> (file:///home/connolly/Desktop/xs-replay/src/xsnap.js:116:22)
    at ChildProcess.emit (events.js:315:20)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:12681) UnhandledPromiseRejectionWarning: ...

@phoddie
Copy link

phoddie commented Jun 17, 2021

This JS recording is outside the XS VM; it's JS on the node side

Thank you for the clarification. When I saw "JavaScript" I assumed it was in XS.

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

bytecode trace of the replay shows FERAL_EVAL, part of the SES bootstrap

I turned on mxTrace and got a 4,153,132 line bytecode trace:

It ends with:

...

204: get_variable [setImmediate] 0 3
222: begin_strict
222: reserve 6
228: retrieve 4 [FERAL_EVAL] [localObject] [reflectGet] [globalObject]
228: new_local [_shadow] 4
228: new_local [prop] 5
228: file [#0]
228: line 3782
228: argument 0
229: var_local_1 4
229: pop
228: argument 1
229: var_local_1 5
229: pop
228: get_local 5
229: typeof
229: string "symbol"
230: strict_equal
229: branch_else_2
228: line 3790
228: get_local 5
229: string "eval"
230: strict_equal
229: branch_else_2
228: line 3799
228: get_local 5
229: get_closure 1
230: in
229: branch_else_2
228: line 3816
228: undefined
229: get_closure 2
230: call
234: get_closure 3
235: get_local 5
236: run_tail_1 2

204: run_1 1 2

198: await
 1 0
205: begin_strict
205: reserve 1
206: retrieve 1 [value]
206: file [#0]
206: line 7820
206: get_closure 0
207: result
206: end



 1
 1 1 [?]

198: branch_status_2
198: pop
197: line 4608
197: undefined
198: get_closure 0
199: call
203: run_1 0(node:19077) UnhandledPromiseRejectionWarning: ExitSignal: v11:bootstrap exited due to signal SIGSEGV

FERAL_EVAL doesn't occur in xsnap-tests/3/; it only occurs in xsnap-tests/1/:

$ grep -r FERAL_EVAL /tmp/xsnap-tests/
/tmp/xsnap-tests/1/00001-evaluate.dat:  const FERAL_EVAL = eval; /**
/tmp/xsnap-tests/1/00001-evaluate.dat:            this.useUnsafeEvaluator = false;return FERAL_EVAL;

I'm attaching all of /tmp/xsnap-tests/ as xsnap-tests.zip.

Trace 1 generates snapshot used to start trace 3

I clarified the names of the intermediate snapshot files in 7057981 ; now I can see what's going on:

The /tmp/xsnap-tests/3/ trace that leads to the SIGSEGV is from an xsnap process called v11:bootstrap. Note that it starts life by reading a snapshot:

$ jq . /tmp/xsnap-tests/3/00000-options.json 
{
  "os": "Linux",
  "name": "v11:bootstrap",
  "debug": true,
  "snapshot": "/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/c5d82df2198c546034d7df17565503156f6e8684be1eb243c507550b3c06c770-load-aNDEbV.xss",
  "meteringLimit": 10000000
}

So where did that snapshot come from? Using xs-snapshots.zip along with xsnap-tests/ we can see:

$ ls xs-snapshots/c5d82df*save*
xs-snapshots/c5d82df2198c546034d7df17565503156f6e8684be1eb243c507550b3c06c770-save-raw-9Caj6D.xss

$ grep -r save-raw-9Caj6D xsnap-tests/
xsnap-tests/1/00003-snapshot.dat:/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/save-raw-9Caj6D.xss

This xsnap-tests/1/ trace comes from an xsnap process that started without a snapshot; so now we should have the whole execution trace leading up to the SIGSEGV:

$ jq . xsnap-tests/1/00000-options.json 
{
  "os": "Linux",
  "name": "v1:bank",
  "debug": true,
  "meteringLimit": 0
}

I don't yet have a handy way to combine xsnap-tests/1/ and xsnap-tests/3/ to replay them all together.

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

SEGV is in gc()?

Looks like we're in 00002-evaluate.dat, not 00001-evaluate.dat:

$ grep -nr setImmediate /tmp/xsnap-tests/
...
/tmp/xsnap-tests/1/00002-evaluate.dat:4527:/* global setImmediate */
/tmp/xsnap-tests/1/00002-evaluate.dat:4605:    await new Promise(setImmediate);
/tmp/xsnap-tests/1/00002-evaluate.dat:4607:    await new Promise(setImmediate);

The line 4608 seems to correlate with https://github.com/Agoric/agoric-sdk/blob/master/packages/SwingSet/src/gc-and-finalize.js#L82 , i.e.

function makeGcAndFinalize(gcPower) {
  if (typeof gcPower !== 'function') {
    console.warn(
    Error(`no gcPower() function; skipping finalizer provocation`));

  }
  return async function gcAndFinalize() {
    if (typeof gcPower !== 'function') {
      return;
    }

    /* on Node.js, GC seems to work better if the promise queue is empty first*/
    await new Promise(setImmediate);
    /* on xsnap, we must do it twice for some reason*/
    await new Promise(setImmediate);
    gcPower();  /// HERE
    /* this gives finalizers a chance to run*/
    await new Promise(setImmediate);
  };
}

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

@FUDCo managed to reproduce the SIGSEGV on a mac, with a little fiddling around with directory paths recorded on my machine that had to be adjusted for his.

I think I can teach the replay tool to ignore the differences. I'm also interested to see if we can skip the snapshots and reproduce the SIGSEGV just by running the steps in xsnap-tests/1/ and then xsnap-tests/3/.

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

I'm also interested to ... reproduce the SIGSEGV just by running the steps in xsnap-tests/1/ and then xsnap-tests/3/.

Done in 74019f6 / xs-replay.zip. Now I can clean up the issue description with a concise description of how to reproduce it.

Should work on mac too.

make
node --version | grep 14.
v14.16.0
mkdir -p build/bin/lin/debug
cp -rp /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/build/bin/lin/debug/xsnap build/bin/lin/debug/xsnap
mkdir -p build/bin/lin/release
cp -rp /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/build/bin/lin/release/xsnap build/bin/lin/release/xsnap
node replay-bin.cjs /tmp/xsnap-tests/1/ /tmp/xsnap-tests/3/
/tmp/xsnap-tests/1/ 00000-options.json : { os: 'Linux', name: 'v1:bank', debug: true, meteringLimit: 0 }
/tmp/xsnap-tests/1/ 1 evaluate
/tmp/xsnap-tests/1/ 2 evaluate
/tmp/xsnap-tests/1/ 3 snapshot
/tmp/xsnap-tests/1/ 00003-snapshot.dat ignoring remaining steps from /tmp/xsnap-tests/1/
/tmp/xsnap-tests/3/ 00000-options.json already spawned; ignoring: {
  os: 'Linux',
  name: 'v11:bootstrap',
  debug: true,
  snapshot: '/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/c5d82df2198c546034d7df17565503156f6e8684be1eb243c507550b3c06c770-load-aNDEbV.xss',
  meteringLimit: 10000000
}
/tmp/xsnap-tests/3/ 1 evaluate
/tmp/xsnap-tests/3/ 2 issueCommand
/tmp/xsnap-tests/3/ 3 reply
/tmp/xsnap-tests/3/ 4 issueCommand
/tmp/xsnap-tests/3/ 5 reply
/tmp/xsnap-tests/3/ 6 reply
/tmp/xsnap-tests/3/ 7 reply
/tmp/xsnap-tests/3/ 8 issueCommand
/tmp/xsnap-tests/3/ 9 reply
/tmp/xsnap-tests/3/ 10 reply
/tmp/xsnap-tests/3/ 11 issueCommand
/tmp/xsnap-tests/3/ 12 reply
/tmp/xsnap-tests/3/ 13 reply
/tmp/xsnap-tests/3/ 14 reply
/tmp/xsnap-tests/3/ 15 reply
/tmp/xsnap-tests/3/ 16 reply
/tmp/xsnap-tests/3/ 17 reply
/tmp/xsnap-tests/3/ 18 reply
/tmp/xsnap-tests/3/ 19 reply
/tmp/xsnap-tests/3/ 20 reply
/tmp/xsnap-tests/3/ 21 reply
/tmp/xsnap-tests/3/ 22 reply
/tmp/xsnap-tests/3/ 23 reply
/tmp/xsnap-tests/3/ 24 reply
/tmp/xsnap-tests/3/ 25 issueCommand
/tmp/xsnap-tests/3/ 26 issueCommand
/tmp/xsnap-tests/3/ 27 issueCommand
/tmp/xsnap-tests/3/ 28 issueCommand
/tmp/xsnap-tests/3/ 29 issueCommand
/tmp/xsnap-tests/3/ 30 issueCommand
/tmp/xsnap-tests/3/ 31 reply
/tmp/xsnap-tests/3/ 32 reply
/tmp/xsnap-tests/3/ 33 reply
/tmp/xsnap-tests/3/ 34 reply
/tmp/xsnap-tests/3/ 35 reply
/tmp/xsnap-tests/3/ 36 reply
/tmp/xsnap-tests/3/ 37 reply
/tmp/xsnap-tests/3/ 38 reply
/tmp/xsnap-tests/3/ 39 issueCommand
/tmp/xsnap-tests/3/ 40 issueCommand
/tmp/xsnap-tests/3/ 41 issueCommand
/tmp/xsnap-tests/3/ 42 reply
/tmp/xsnap-tests/3/ 43 reply
/tmp/xsnap-tests/3/ 44 issueCommand
/tmp/xsnap-tests/3/ 45 reply
/tmp/xsnap-tests/3/ 46 reply
/tmp/xsnap-tests/3/ 47 issueCommand
/tmp/xsnap-tests/3/ 48 reply
/tmp/xsnap-tests/3/ 49 reply
/tmp/xsnap-tests/3/ 50 issueCommand
/tmp/xsnap-tests/3/ 51 issueCommand
/tmp/xsnap-tests/3/ 52 reply
/tmp/xsnap-tests/3/ 53 reply
/tmp/xsnap-tests/3/ 54 reply
/tmp/xsnap-tests/3/ 55 reply
/tmp/xsnap-tests/3/ 56 reply
/tmp/xsnap-tests/3/ 57 reply
/tmp/xsnap-tests/3/ 58 issueCommand
/tmp/xsnap-tests/3/ 59 issueCommand
/tmp/xsnap-tests/3/ 60 issueCommand
/tmp/xsnap-tests/3/ 61 reply
/tmp/xsnap-tests/3/ 62 reply
/tmp/xsnap-tests/3/ 63 reply
/tmp/xsnap-tests/3/ 64 reply
/tmp/xsnap-tests/3/ 65 reply
/tmp/xsnap-tests/3/ 66 reply
/tmp/xsnap-tests/3/ 67 reply
/tmp/xsnap-tests/3/ 68 reply
/tmp/xsnap-tests/3/ 69 reply
/tmp/xsnap-tests/3/ 70 reply
/tmp/xsnap-tests/3/ 71 reply
/tmp/xsnap-tests/3/ 72 reply
/tmp/xsnap-tests/3/ 73 reply
/tmp/xsnap-tests/3/ 74 reply
/tmp/xsnap-tests/3/ 75 reply
/tmp/xsnap-tests/3/ 76 reply
/tmp/xsnap-tests/3/ 77 issueCommand
/tmp/xsnap-tests/3/ 78 reply
/tmp/xsnap-tests/3/ 79 reply
/tmp/xsnap-tests/3/ 80 snapshot
/tmp/xsnap-tests/3/ 81 issueCommand
(node:27458) UnhandledPromiseRejectionWarning: ExitSignal: v1:bank exited due to signal SIGSEGV
    at ChildProcess.<anonymous> (file:///home/connolly/Desktop/xs-replay/src/xsnap.js:116:22)
    at ChildProcess.emit (events.js:315:20)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:27458) UnhandledPromiseRejectionWarning: ...

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

@phoddie I verified the fix Patrick sent this morning. Thanks!

I look forward to seeing it in https://github.com/Moddable-OpenSource/moddable ; meanwhile, I checked it in as agoric-labs/moddable@e365a89

@phoddie
Copy link

phoddie commented Jun 17, 2021

Excellent news. Thank you for the confirmation.

@dckc
Copy link
Member Author

dckc commented Jun 17, 2021

@phoddie agoric start gets further now, but it runs into another SIGSEGV.

recording: segv611.zip

$ node -r esm src/replay.js /tmp/xsnap-tests/1/ /tmp/xsnap-tests/14/
/tmp/xsnap-tests/1/ 00000-options.json : { os: 'Linux', name: 'v1:bank', debug: false, meteringLimit: 0 }
...
/tmp/xsnap-tests/14/ 157 reply
/tmp/xsnap-tests/14/ 158 reply
ExitSignal: v1:bank exited due to signal SIGSEGV
    at ChildProcess.<anonymous> (/home/connolly/projects/agoric/agoric-sdk/packages/xsnap/src/xsnap.js:116:22)
    at ChildProcess.emit (events.js:210:5)
    at ChildProcess.EventEmitter.emit (domain.js:475:20)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:272:12) {
  name: 'ExitSignal',
  code: 'SIGSEGV'
}

gdb shows it deep in GC:

#0  0x0000559f98eb301c in fxMarkValue ()
#1  0x0000559f98eb39fb in fxMarkValue ()
#2  0x0000559f98eb39bb in fxMarkValue ()
...
#70 0x0000559f98eb39bb in fxMarkValue ()
#71 0x0000559f98eb49cf in fxMark.constprop.7 ()
#72 0x0000559f98eb4fae in fxCollect.part.4 ()
#73 0x0000559f98ed1978 in fxRunID ()
#74 0x0000559f98e94f64 in fxStepAsync ()
#75 0x0000559f98ed1978 in fxRunID ()
#76 0x0000559f98ec26bc in fxOnResolvedPromise ()
#77 0x0000559f98ed1978 in fxRunID ()
#78 0x0000559f98ec6db6 in fxRunPromiseJobs ()
#79 0x0000559f98f14e9e in fxRunLoop ()
#80 0x0000559f98e3deb0 in main ()

I tried to run in debug mode to get a better stack trace, but I ran into another mystery:

$ XSNAP_DEBUG=1 XSNAP_TEST_RECORD=1 agoric start -v --reset
...
cannot write snapshot /home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/save-raw-01LKNs.xss: Operation now in progress
error running ag-solo: (ExitCode#1)

This is the same vat (xsnap-tests/) that crashes in non-debug mode"

$ grep -r save-raw-01LKNs.xss /tmp/xsnap-tests/
/tmp/xsnap-tests/14/00126-snapshot.dat:/home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/save-raw-01LKNs.xss

@dckc
Copy link
Member Author

dckc commented Jun 18, 2021

To diagonse that Operation now in progress, I tried #include <libexplain/fwrite.h>; it showed size = -2147483635 among the args to fwrite. So I added a thing to abort() on writes larger than 16M. It hit size=0xffffffff8000000d and gave me a stack trace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ff6f4252921 in __GI_abort () at abort.c:79
#2  0x0000563758f5a3fc in fxSnapshotWrite (stream=0x563759b0c620, address=0x7ff6f1210640, size=18446744071562067981)
    at ../../src/xsnap.c:159
#3  0x0000563758f2e69e in fxWriteChunkData (the=0x563759b054e0, snapshot=0x7ffee85e3010, address=0x7ff6f1210650)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsSnapshot.c:1141
#4  0x0000563758f2e4d4 in fxWriteChunk (the=0x563759b054e0, snapshot=0x7ffee85e3010, slot=0x7ff6f186ba50)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsSnapshot.c:1099
#5  0x0000563758f2e901 in fxWriteChunks (the=0x563759b054e0, snapshot=0x7ffee85e3010)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsSnapshot.c:1180
#6  0x0000563758f2ff3e in fxWriteSnapshot (the=0x563759b054e0, snapshot=0x7ffee85e3010)
    at /home/connolly/projects/agoric/agoric-sdk/packages/xsnap/makefiles/lin/../../moddable/xs/sources/xsSnapshot.c:1520
#7  0x0000563758f5c508 in main (argc=6, argv=0x7ffee85e3888) at ../../src/xsnap.c:536

around xsSnapshot.c:1099 is:

	case XS_KEY_KIND:
		if (slot->value.key.string)
			fxWriteChunkData(the, snapshot, slot->value.key.string);
		break;

@dckc
Copy link
Member Author

dckc commented Jun 18, 2021

Error stack fix confirmed

@phoddie I can confirm that the segv611.zip bug is fixed by the error stack fix.

agoric-labs/moddable@31364f0

next problem: unknown callback! (trace at 14:32)

xsnap-traces-2021-06-18T14:32-05:00.zip

To replay multiple trace, be sure to use the updated xs-replay.zip from 74019f6 above.

node replay-bin.cjs /tmp/xsnap-tests/1/ /tmp/xsnap-tests/2/
/tmp/xsnap-tests/1/ 00000-options.json : { os: 'Linux', name: 'v1:bank', debug: false, meteringLimit: 0 }
/tmp/xsnap-tests/1/ 1 evaluate
/tmp/xsnap-tests/1/ 2 evaluate
/tmp/xsnap-tests/1/ 3 snapshot
/tmp/xsnap-tests/1/ 00003-snapshot.dat ignoring remaining steps from /tmp/xsnap-tests/1/
/tmp/xsnap-tests/2/ 00000-options.json already spawned; ...
...
/tmp/xsnap-tests/2/ 43 reply
/tmp/xsnap-tests/2/ 44 snapshot
# snapshot: unknown callback!
cannot write snapshot /home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/save-raw-D6fz47.xss: Invalid argument

For (my) reference, the context where I encountered it was starting a chain:

$ XSNAP_TEST_RECORD=1 agoric start -v --reset
...
2021-06-18T19:24:04.637Z ag-solo: init: ag-solo initialized in dev
...
2021-06-18T19:25:33.583Z SwingSet: kernel: @@DEBUG cannot make snapshot: { snapStore: true, makeSnapshot: false }
# snapshot: unknown callback!
cannot write snapshot /home/connolly/projects/agoric/dapp-2865/_agstate/agoric-servers/dev/fake-chain-sim-chain-state/xs-snapshots/save-raw-D6fz47.xss: Invalid argument
2021-06-18T19:25:33.603Z fake-chain: (ExitCode#1)
ExitCode#1: v2:board exited: I/O error

@dckc
Copy link
Member Author

dckc commented Jun 18, 2021

next problem: unknown callback! (trace at 14:32)

Diagnosis: get a core dump (add abort() before the unknown callback! assertion in xsSnapshot.c) and find:

(gdb) p callback
$1 = (txCallback) 0x5561ba341693 <fx_Function_prototype_bound>

Then add fx_Function_prototype_bound to gxCallbacks:
agoric-labs/moddable@68127cc

agoric start completes start-up!

$ XSNAP_TEST_RECORD=1 agoric start -v --reset
...
2021-06-18T20:25:43.970Z ag-solo: init: ag-solo initialized in dev
...
2021-06-18T20:27:21.916Z outbound: invoking deliverator; 0 new messages for sim-chain
Deployed Wallet!

TODO: update moddable submodule to 68127cc

Then we can close this.

p.s. or rather: to 4067acf52095

@dckc dckc changed the title agoric start crashes with SIGSEGV on 2273-snapstore branch xsnap crashes (SIGSEGV) in GC after snapshot; error stack bug; missing callback Jun 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xsnap the XS execution tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants