Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

Merged
merged 12 commits into from
May 30, 2024

Conversation

nat-goodspeed
Copy link
Collaborator

No description provided.

Since August 2023, we've seen occasional GitHub Windows build test runs
terminate with 0xC00000FD: stack overflow. We've usually responded by bumping
up the default coroutine stack size.

On closer examination, it's always llleap_test.cpp that blows up that way --
and llleap_test.cpp doesn't appear to use coroutines at all. So apparently
we've been consuming more address space for ALL viewer coroutines without
actually addressing the problem.

Reset the default coroutine stack size to where it was before we started
bumping it up in response to these llleap_test.cpp stack overflow failures.
Note that LLCoros already catches and reports Windows structured exceptions,
underscoring that the observed stack overflow is not from within a coroutine.

While at it, restore the Windows llleap_test.cpp data volume to match Posix.
We think the problem that led to reducing that data volume was an APR bug,
which we hope has been fixed.

Equip test.cpp, the test driver program for all our TUT unit and integration
tests, with a Windows structured exception handler. Try to treat a Windows
structured exception as a test failure -- instead of silently terminating with
0xC00000FD. Moreover, when a structured exception occurs, output a stack trace
so we can try to track it down.
Copy link
Collaborator

@brad-linden brad-linden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems worth a shot.

@@ -35,7 +35,7 @@
// causes Windows abdominal pain such that it later fails code-signing in some
// mysterious way. Entirely suppressing these LLLeap tests pushes the failure
// rate MUCH lower. Can we re-enable them with a smaller data size on Windows?
const size_t BUFFERED_LENGTH = 100*1024;
const size_t BUFFERED_LENGTH = 1023*1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although that comment is disturbing, possibly this would be better consolidated outside of the #if defined(LL_WINDOWS) block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would dearly love to believe that the flaky behavior was due to an old APR bug which has since been fixed. Since I'm not yet positive, I left the #if conditional -- but yes, my intention is to reunite the platforms.

# its value when false is "false", which is interpreted as true.
RELEASE_RUN: ${{ (github.event.inputs.release_run || github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life')) && 'Y' || '' }}
# its value when false is "false", which (again) is interpreted as true.
RELEASE_RUN: ${{ (github.event.inputs.release_run != 'false' || (github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life'))) && 'Y' || '' }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, sorry about that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to throw shade! Even if you echo github.event.inputs.relase_run to stdout, you see false either way. There's no way to directly ask the workflow engine for the type of that variable, and since you explicitly state that it's boolean, it's entirely reasonable to think that type might persist. I only got here because empirically it wasn't working, and empirically this change made it better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah no offense taken, just lamenting how hard it is to test this CI stuff before it lands

@nat-goodspeed
Copy link
Collaborator Author

n.b. This is what I was shooting for:

[llleap, 9: unknown] exception: Windows_SEH_exception: 'test threw c00000fd (Stack Overflow)'

// function when the stack is already blown only terminates us faster.
if (code != STATUS_STACK_FULL)
{
std::cerr << boost::stacktrace::stacktrace() << std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not worth holding this up, but is there something else minimal we could do? is there a way to grab just the top function name of the stack trace without blowing the stack further?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what part of a stack frame Boost.Stacktrace needs to touch to derive that lowest-level entry. What I got without that if was an Access Violation, which I think means Boost.Stacktrace was trying to look at a part of the frame that overlapped beyond the end of the available stack. And it would be the lowest-level function that overlaps. So... I don't know if there's anything we can safely do there.

Introduce AlwaysReturn<void> specialization, which always discards any result
of calling the specified callable with specified args.

Derive new Windows_SEH_exception from LLException, not std::runtime_error.

Put the various SEH functions in LL::seh nested namespace, e.g.
LL::seh::catcher() as the primary API.

Break out more levels of Windows SEH handler to work around the restrictions on
functions containing __try/__except.

The triadic catcher() overload now does little save declare a std::string
stacktrace before forwarding the call to catcher_inner(), passing a reference
to stacktrace along with the trycode, filter and handler functions.

catcher_inner() accepts the stacktrace and the three function template
arguments. It contains the __try/__except logic. It calls a new filter_()
wrapper template, which calls fill_stacktrace() before forwarding the call to
the caller's filter function. fill_stacktrace(), in the .cpp file, contains
the logic to populate the stacktrace string -- unless the Structured Exception
is stack overflow, in which case it puts an explanatory string instead.

catcher_inner()'s __except clause passes not only the code, but also the
stacktrace string, to the caller's handler function. It wraps the caller's
handler function in always_return<rtype>(), where rtype is the type returned
by the trycode function. This allows a handler to return a value, while also
supporting the void handler case, e.g. one that throws a C++ exception. (This
is why we need AlwaysReturn<void>: some trycode() functions are themselves
void.)

For the dyadic catcher() overload, introduce common_filter() containing the
logic to distinguish a C++ exception from any other kind of Structured
Exception. The fact that the stacktrace is captured before the filter function
is called should permit capturing a stacktrace for a C++ exception as well as
for most other Structured Exceptions.

As before, the monadic catcher() overload supplies the rethrow() handler, in
the .cpp file.

Change existing calls from seh_catcher() to LL::seh::catcher().
Copy link
Collaborator

@brad-linden brad-linden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nat-goodspeed nat-goodspeed changed the base branch from main to release/maint-b May 30, 2024 15:15
@nat-goodspeed nat-goodspeed merged commit a8b6112 into release/maint-b May 30, 2024
14 checks passed
@nat-goodspeed nat-goodspeed deleted the nat/catch-test-blown-stack branch May 30, 2024 16:36
@github-actions github-actions bot locked and limited conversation to collaborators May 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants