Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate test-http2-createwritereq failure on Linux and AIX #17840

Closed
Trott opened this issue Dec 23, 2017 · 18 comments
Closed

Investigate test-http2-createwritereq failure on Linux and AIX #17840

Trott opened this issue Dec 23, 2017 · 18 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. http2 Issues or PRs related to the http2 subsystem.

Comments

@Trott
Copy link
Member

Trott commented Dec 23, 2017

  • Version: 10.0.0-pre
  • Platform: aix61-ppc64
  • Subsystem: test http2

https://ci.nodejs.org/job/node-test-commit-aix/11376/nodes=aix61-ppc64/console

not ok 934 parallel/test-http2-createwritereq
  ---
  duration_ms: 1.818
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)
@Trott Trott added aix Issues and PRs related to the AIX platform. http2 Issues or PRs related to the http2 subsystem. test Issues and PRs related to the tests. labels Dec 23, 2017
@Trott
Copy link
Member Author

Trott commented Dec 23, 2017

I guess I'll start by pinging the test author: @apapirovski

@Trott
Copy link
Member Author

Trott commented Dec 23, 2017

Signal 11 = segmentation fault. I'm guessing AIX on CI is not configured to preserve core files, but let's ask. Ping @nodejs/build.

@Trott
Copy link
Member Author

Trott commented Dec 23, 2017

And ping @nodejs/platform-aix for additional troubleshooting and/or suggestions on how to proceed.

@addaleax
Copy link
Member

Is there any way to run stress tests at a given git SHA or something like that? #17406 or #17718 would seem like possible causes to me

@Trott
Copy link
Member Author

Trott commented Dec 23, 2017

Is there any way to run stress tests at a given git SHA or something like that? #17406 or #17718 would seem like possible causes to me

In the past, I've created branches at specific SHAs, pushed them to my fork (possibly to my master branch, I can't remember if I got it to work off another branch or not), and run the stress test off of my fork.

@Trott
Copy link
Member Author

Trott commented Dec 23, 2017

I guess it might make sense to run a stress test against master to see if this is reproducible.

Running it serially: https://ci.nodejs.org/job/node-stress-single-test/1574/nodes=aix61-ppc64/

Running it in parallel (96 processes): https://ci.nodejs.org/job/node-stress-single-test/1575/nodes=aix61-ppc64/

@gireeshpunathil
Copy link
Member

Ran a 2000 times locally with no luck on recreate. Either wait for @apapirovski to seek hints on possible causes, or gain login access to CI. /cc @mhdawson @gibfahn

The test does not seem to use too much of memory (few bytes of data transported through http2 and validate it passes through properly with differing encodings) which rules out native memory constraints as a source of trouble (which was a known difference between CI and local AIX boxes that led to some failures.)

@Trott
Copy link
Member Author

Trott commented Dec 24, 2017

Reproduced it in CI stress tests, so at least it's reproducible in that environment. Running serially, it got 31 failures in 9999 runs:

9999   OK: 9968   NOT OK: 31   TOTAL: 9999
+ '[' 31 '!=' 0 ']'
+ echo The test is flaky.
The test is flaky.
+ exit 1
Build step 'Conditional steps (multiple)' marked build as failure
Notifying upstream projects of job completion
Finished: FAILURE

...so running under load is not a precondition for failure.

@apapirovski
Copy link
Member

I don't think that test does anything particularly weird. If I had to guess, other http2 tests that write stuff might be flaky on that platform too. It's reasonably likely that the recent changes to http2 are responsible. I'll try to have a look when I have time but as @addaleax mentioned, a good place to start is running a couple of stress tests against master rewound to the two PRs above.

@gireeshpunathil
Copy link
Member

thanks @apapirovski . @Trott - is it (artificially rewinding the two PRs and running stress on it) something which can be done through the existing build jobs or custom scripts?

In terms of isolation: both the PRs contain considerable amount of code changes that makes it difficult for me to manually locate the crash reason. On the other hand, if we isolate the issue into one of these PRs, still the failure reason needs to come bottom up (starting with the crashing context). So I prefer (in my way of problem determination) performing core dump analysis directly in the CI machine.

@Trott
Copy link
Member Author

Trott commented Dec 24, 2017

❌ = failure, ✅ = success

❌ Stress test rewound to e0e6b68 (right after #11718 landed): https://ci.nodejs.org/job/node-stress-single-test/1576/nodes=aix61-ppc64/

✅ Stress test rewound to e3b44b6 (one commit before the above so that it's just the first of the two commits that landed in #11718): https://ci.nodejs.org/job/node-stress-single-test/1577/nodes=aix61-ppc64/

✅ Stress test rewound to e554bc8 (right before #11718): https://ci.nodejs.org/job/node-stress-single-test/1579/nodes=aix61-ppc64/

@Trott
Copy link
Member Author

Trott commented Dec 24, 2017

@gireeshpunathil Stress tests aren't done running yet, but it sure looks like the issue is probably with e0e6b68. (Stress test at that commit failed, and the stress test at each of the two commits before it are both clean so far....

@Trott
Copy link
Member Author

Trott commented Dec 25, 2017

Stress tests are done and the evidence does point to e0e6b68....

@maclover7 maclover7 added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Dec 25, 2017
@gireeshpunathil
Copy link
Member

thanks @Trott , for the info

@addaleax addaleax removed the aix Issues and PRs related to the AIX platform. label Dec 25, 2017
@addaleax
Copy link
Member

Btw, I could reproduce this locally on Linux, so it isn’t AIX-specific … it’s my patch that caused this so I’ll try to figure out what’s going on here :)

@gireeshpunathil
Copy link
Member

thanks @addaleax , that is good news as in we don't need to access CI. How frequent is the crash? similar to that of AIX?

@addaleax
Copy link
Member

It’s less than 1 in 1000 – I think that’s about the same rate.

From some initial poking around, it seems like one of the Http2Session instances is garbage collected while it is handling data from the socket (i.e. inside nghttp2_session_mem_recv()).

@Trott Trott changed the title Investigate test-http2-createwritereq failure on AIX Investigate test-http2-createwritereq failure on Linux and AIX Dec 25, 2017
addaleax added a commit to addaleax/node that referenced this issue Dec 25, 2017
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: nodejs#17840
@addaleax
Copy link
Member

addaleax commented Dec 25, 2017

Reproducible test case + fix @ #17863

removing the test label here because it’s not an issue with the test itself

@addaleax addaleax added flaky-test Issues and PRs related to the tests with unstable failures on the CI. and removed flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. labels Dec 25, 2017
jasnell pushed a commit that referenced this issue Dec 28, 2017
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: #17840

PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit to jasnell/node that referenced this issue Jan 9, 2018
PR-URL: nodejs#17863
Fixes: nodejs#17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit to jasnell/node that referenced this issue Jan 9, 2018
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: nodejs#17840

PR-URL: nodejs#17863
Fixes: nodejs#17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Jan 9, 2018
Backport-PR-URL: #18050
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Jan 9, 2018
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: #17840

Backport-PR-URL: #18050
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Jan 10, 2018
Backport-PR-URL: #18050
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Jan 10, 2018
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: #17840

Backport-PR-URL: #18050
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
kjin pushed a commit to kjin/node that referenced this issue May 1, 2018
Backport-PR-URL: nodejs#18050
PR-URL: nodejs#17863
Fixes: nodejs#17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
kjin pushed a commit to kjin/node that referenced this issue May 1, 2018
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: nodejs#17840

Backport-PR-URL: nodejs#18050
PR-URL: nodejs#17863
Fixes: nodejs#17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue May 2, 2018
Backport-PR-URL: #18050
Backport-PR-URL: #20456
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue May 2, 2018
Keep a local handle as a reference to the JS `Http2Session`
object so that it will not be garbage collected
when inside an `Http2Scope`, because the presence of the
latter usually indicates that further actions on
the session object are expected.

Strictly speaking, storing the `session_handle_` as a
property on the scope object is not necessary, but
this is not very costly and makes the code more
obviously correct.

Fixes: #17840

Backport-PR-URL: #18050
Backport-PR-URL: #20456
PR-URL: #17863
Fixes: #17840
Reviewed-By: Anatoli Papirovski <apapirovski@mac.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. http2 Issues or PRs related to the http2 subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants