Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert #17914 #18238

Closed
wants to merge 6 commits into from
Closed

Conversation

joyeecheung
Copy link
Member

See if this fixes #18190

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. labels Jan 18, 2018
@joyeecheung
Copy link
Member Author

@apapirovski
Copy link
Member

@jasnell
Copy link
Member

jasnell commented Jan 18, 2018

That test is particularly sensitive. I'd prefer not to revert these commits. The test likely needs to be updated instead.

@apapirovski
Copy link
Member

apapirovski commented Jan 18, 2018

@jasnell I think @joyeecheung is just trying to trace down the cause, rather than submit an official PR to revert. Right now we're not even sure which commit is causing it as there's some contradictory evidence in past CI runs.

@jasnell
Copy link
Member

jasnell commented Jan 18, 2018

Yep, I get that. There are several things that could cause this. In my experience, it has typically been the addition/removal of a new AsyncWrap Provider type, but there are several other reasons why this can fail in not so obvious ways due to the magic of how async_hooks are implemented. It can be difficult to track down.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

@apapirovski @jasnell Yeah I opened this because I had no idea how to run https://ci.nodejs.org/job/node-stress-single-test on shas so I kinda have to do the bisecting this way...

(Although I still have no idea how to run the flaky tests with PR refs, there are tons of EACCESS on common.PIPE..)

@joyeecheung joyeecheung added the wip Issues and PRs that are still a work in progress. label Jan 18, 2018
@apapirovski
Copy link
Member

apapirovski commented Jan 18, 2018

Well, 4 CI runs on Linux and this test hasn't failed a single time on alpine... it seems like the issue is somewhere in these commits and their interaction with that test. I still don't see any offending code tho... 🤔

Edit: it does seem like the Windows failure could be unrelated tho. So we potentially have two different bugs in that test.

@jasnell
Copy link
Member

jasnell commented Jan 18, 2018

That particular bit fails when something keeps the event loop open... ping @addaleax who tracked this down previously. When we've seen this before, it was caused by very subtle differences in timing of garbage collection. In one recent case, a change in the node_perf.cc on how the gc callback was scheduled to run caused a bug that only became apparent because the completely unrelated http2 module happened to load just enough string constants that the gc was triggered, and it caused beforeExit to fire twice. Fun eh? @addaleax may be able to give some pointers on how to track this down.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

#17914 actually only touches sync APIs after a rebase against #18144 , and the only sync API used in that test is openSync which is not touched by #17914.... :/? So I think this probably has something to do with the timings/load, like the previous flake, not the APIs..

@apapirovski
Copy link
Member

apapirovski commented Jan 18, 2018

@jasnell yeah, I worked on that particular issue so it was the first thing that came to mind but I don't see any changes that are responsible for keeping the event loop open. I can't replicate locally either which makes it all the more difficult to debug.

On that note: is there a particular reason we're checking in beforeExit? Could that check be moved to exit?

Although... anything that keeps the event loop open after beforeExit fires but doesn't originate from the listener is concerning.

@jasnell
Copy link
Member

jasnell commented Jan 18, 2018

Unfortunately it could be any unref'd handle, it doesn't have to be limited to this change. For instance, in the http2 PR that triggered this, there was zero code in the PR that was directly responsible. Hmm.. going to have to think about that a bit more.

Re: why beforeExit it being used in the test, I'm not sure. You'd have to ask @trevnorris

@apapirovski
Copy link
Member

I think we're on the same page. I just find it concerning that we have another instance of some side-effect that brings the loop alive. The SetImmediate bit was definitely a bug. Even if this PR just happened to surface an unrelated bug, it's important to not just sweep it under the rug by reverting, working around it or changing to use an exit listener.

Will try to spend some time digging around to see what other code we have that can bring a loop alive as a side-effect.

@apapirovski
Copy link
Member

@joyeecheung @addaleax @jasnell I believe I have a fix in #18241

@joyeecheung
Copy link
Member Author

Closed now that #18241 landed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. wip Issues and PRs that are still a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate flaky sequential/test-async-wrap-getasyncid
4 participants