-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496
Conversation
Since August 2023, we've seen occasional GitHub Windows build test runs terminate with 0xC00000FD: stack overflow. We've usually responded by bumping up the default coroutine stack size. On closer examination, it's always llleap_test.cpp that blows up that way -- and llleap_test.cpp doesn't appear to use coroutines at all. So apparently we've been consuming more address space for ALL viewer coroutines without actually addressing the problem. Reset the default coroutine stack size to where it was before we started bumping it up in response to these llleap_test.cpp stack overflow failures. Note that LLCoros already catches and reports Windows structured exceptions, underscoring that the observed stack overflow is not from within a coroutine. While at it, restore the Windows llleap_test.cpp data volume to match Posix. We think the problem that led to reducing that data volume was an APR bug, which we hope has been fixed. Equip test.cpp, the test driver program for all our TUT unit and integration tests, with a Windows structured exception handler. Try to treat a Windows structured exception as a test failure -- instead of silently terminating with 0xC00000FD. Moreover, when a structured exception occurs, output a stack trace so we can try to track it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems worth a shot.
@@ -35,7 +35,7 @@ | |||
// causes Windows abdominal pain such that it later fails code-signing in some | |||
// mysterious way. Entirely suppressing these LLLeap tests pushes the failure | |||
// rate MUCH lower. Can we re-enable them with a smaller data size on Windows? | |||
const size_t BUFFERED_LENGTH = 100*1024; | |||
const size_t BUFFERED_LENGTH = 1023*1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although that comment is disturbing, possibly this would be better consolidated outside of the #if defined(LL_WINDOWS)
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would dearly love to believe that the flaky behavior was due to an old APR bug which has since been fixed. Since I'm not yet positive, I left the #if
conditional -- but yes, my intention is to reunite the platforms.
# its value when false is "false", which is interpreted as true. | ||
RELEASE_RUN: ${{ (github.event.inputs.release_run || github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life')) && 'Y' || '' }} | ||
# its value when false is "false", which (again) is interpreted as true. | ||
RELEASE_RUN: ${{ (github.event.inputs.release_run != 'false' || (github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life'))) && 'Y' || '' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to throw shade! Even if you echo github.event.inputs.relase_run
to stdout, you see false
either way. There's no way to directly ask the workflow engine for the type of that variable, and since you explicitly state that it's boolean
, it's entirely reasonable to think that type might persist. I only got here because empirically it wasn't working, and empirically this change made it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah no offense taken, just lamenting how hard it is to test this CI stuff before it lands
n.b. This is what I was shooting for:
|
indra/test/test.cpp
Outdated
// function when the stack is already blown only terminates us faster. | ||
if (code != STATUS_STACK_FULL) | ||
{ | ||
std::cerr << boost::stacktrace::stacktrace() << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not worth holding this up, but is there something else minimal we could do? is there a way to grab just the top function name of the stack trace without blowing the stack further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what part of a stack frame Boost.Stacktrace needs to touch to derive that lowest-level entry. What I got without that if
was an Access Violation, which I think means Boost.Stacktrace was trying to look at a part of the frame that overlapped beyond the end of the available stack. And it would be the lowest-level function that overlaps. So... I don't know if there's anything we can safely do there.
Introduce AlwaysReturn<void> specialization, which always discards any result of calling the specified callable with specified args. Derive new Windows_SEH_exception from LLException, not std::runtime_error. Put the various SEH functions in LL::seh nested namespace, e.g. LL::seh::catcher() as the primary API. Break out more levels of Windows SEH handler to work around the restrictions on functions containing __try/__except. The triadic catcher() overload now does little save declare a std::string stacktrace before forwarding the call to catcher_inner(), passing a reference to stacktrace along with the trycode, filter and handler functions. catcher_inner() accepts the stacktrace and the three function template arguments. It contains the __try/__except logic. It calls a new filter_() wrapper template, which calls fill_stacktrace() before forwarding the call to the caller's filter function. fill_stacktrace(), in the .cpp file, contains the logic to populate the stacktrace string -- unless the Structured Exception is stack overflow, in which case it puts an explanatory string instead. catcher_inner()'s __except clause passes not only the code, but also the stacktrace string, to the caller's handler function. It wraps the caller's handler function in always_return<rtype>(), where rtype is the type returned by the trycode function. This allows a handler to return a value, while also supporting the void handler case, e.g. one that throws a C++ exception. (This is why we need AlwaysReturn<void>: some trycode() functions are themselves void.) For the dyadic catcher() overload, introduce common_filter() containing the logic to distinguish a C++ exception from any other kind of Structured Exception. The fact that the stacktrace is captured before the filter function is called should permit capturing a stacktrace for a C++ exception as well as for most other Structured Exceptions. As before, the monadic catcher() overload supplies the rethrow() handler, in the .cpp file. Change existing calls from seh_catcher() to LL::seh::catcher().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Retargeting PR #1496 to Maint B.
No description provided.