Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

nat-goodspeed · 2024-05-16T21:07:14Z

No description provided.

Since August 2023, we've seen occasional GitHub Windows build test runs terminate with 0xC00000FD: stack overflow. We've usually responded by bumping up the default coroutine stack size. On closer examination, it's always llleap_test.cpp that blows up that way -- and llleap_test.cpp doesn't appear to use coroutines at all. So apparently we've been consuming more address space for ALL viewer coroutines without actually addressing the problem. Reset the default coroutine stack size to where it was before we started bumping it up in response to these llleap_test.cpp stack overflow failures. Note that LLCoros already catches and reports Windows structured exceptions, underscoring that the observed stack overflow is not from within a coroutine. While at it, restore the Windows llleap_test.cpp data volume to match Posix. We think the problem that led to reducing that data volume was an APR bug, which we hope has been fixed. Equip test.cpp, the test driver program for all our TUT unit and integration tests, with a Windows structured exception handler. Try to treat a Windows structured exception as a test failure -- instead of silently terminating with 0xC00000FD. Moreover, when a structured exception occurs, output a stack trace so we can try to track it down.

brad-linden

seems worth a shot.

brad-linden · 2024-05-16T22:16:05Z

indra/llcommon/tests/llleap_test.cpp

@@ -35,7 +35,7 @@
 // causes Windows abdominal pain such that it later fails code-signing in some
 // mysterious way. Entirely suppressing these LLLeap tests pushes the failure
 // rate MUCH lower. Can we re-enable them with a smaller data size on Windows?
-const size_t BUFFERED_LENGTH =  100*1024;
+const size_t BUFFERED_LENGTH = 1023*1024;


although that comment is disturbing, possibly this would be better consolidated outside of the #if defined(LL_WINDOWS) block?

I would dearly love to believe that the flaky behavior was due to an old APR bug which has since been fixed. Since I'm not yet positive, I left the #if conditional -- but yes, my intention is to reunite the platforms.

brad-linden · 2024-05-16T22:28:07Z

.github/workflows/build.yaml

-      # its value when false is "false", which is interpreted as true.
-      RELEASE_RUN: ${{ (github.event.inputs.release_run || github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life')) && 'Y' || '' }}
+      # its value when false is "false", which (again) is interpreted as true.
+      RELEASE_RUN: ${{ (github.event.inputs.release_run != 'false' || (github.ref_type == 'tag' && startsWith(github.ref_name, 'Second_Life'))) && 'Y' || '' }}


good catch, sorry about that

I don't mean to throw shade! Even if you echo github.event.inputs.relase_run to stdout, you see false either way. There's no way to directly ask the workflow engine for the type of that variable, and since you explicitly state that it's boolean, it's entirely reasonable to think that type might persist. I only got here because empirically it wasn't working, and empirically this change made it better.

yeah no offense taken, just lamenting how hard it is to test this CI stuff before it lands

nat-goodspeed · 2024-05-17T15:17:48Z

n.b. This is what I was shooting for:

[llleap, 9: unknown] exception: Windows_SEH_exception: 'test threw c00000fd (Stack Overflow)'

brad-linden · 2024-05-17T17:40:53Z

indra/test/test.cpp

+        // function when the stack is already blown only terminates us faster.
+        if (code != STATUS_STACK_FULL)
+        {
+            std::cerr << boost::stacktrace::stacktrace() << std::endl;


probably not worth holding this up, but is there something else minimal we could do? is there a way to grab just the top function name of the stack trace without blowing the stack further?

I don't know what part of a stack frame Boost.Stacktrace needs to touch to derive that lowest-level entry. What I got without that if was an Access Violation, which I think means Boost.Stacktrace was trying to look at a part of the frame that overlapped beyond the end of the available stack. And it would be the lowest-level function that overlaps. So... I don't know if there's anything we can safely do there.

Introduce AlwaysReturn<void> specialization, which always discards any result of calling the specified callable with specified args. Derive new Windows_SEH_exception from LLException, not std::runtime_error. Put the various SEH functions in LL::seh nested namespace, e.g. LL::seh::catcher() as the primary API. Break out more levels of Windows SEH handler to work around the restrictions on functions containing __try/__except. The triadic catcher() overload now does little save declare a std::string stacktrace before forwarding the call to catcher_inner(), passing a reference to stacktrace along with the trycode, filter and handler functions. catcher_inner() accepts the stacktrace and the three function template arguments. It contains the __try/__except logic. It calls a new filter_() wrapper template, which calls fill_stacktrace() before forwarding the call to the caller's filter function. fill_stacktrace(), in the .cpp file, contains the logic to populate the stacktrace string -- unless the Structured Exception is stack overflow, in which case it puts an explanatory string instead. catcher_inner()'s __except clause passes not only the code, but also the stacktrace string, to the caller's handler function. It wraps the caller's handler function in always_return<rtype>(), where rtype is the type returned by the trycode function. This allows a handler to return a value, while also supporting the void handler case, e.g. one that throws a C++ exception. (This is why we need AlwaysReturn<void>: some trycode() functions are themselves void.) For the dyadic catcher() overload, introduce common_filter() containing the logic to distinguish a C++ exception from any other kind of Structured Exception. The fact that the stacktrace is captured before the filter function is called should permit capturing a stacktrace for a C++ exception as well as for most other Structured Exceptions. As before, the monadic catcher() overload supplies the rethrow() handler, in the .cpp file. Change existing calls from seh_catcher() to LL::seh::catcher().

brad-linden

👍

Retargeting PR #1496 to Maint B.

nat-goodspeed added 7 commits May 16, 2024 13:30

Need magic #define for <boost/stacktrace.hpp> on Mac.

f10ec30

Try to fix release_run logic in build.yaml.

3c07101

Try determining which-branch once for all platforms.

7e381d6

Add diagnostic output to setvar build step.

006b00b

Try harder to interpret github.event.inputs.release_run checkbox.

189711f

Fix test for github.event.inputs.release_run.

fe14534

nat-goodspeed requested a review from brad-linden May 16, 2024 21:07

github-actions bot added c/cpp llcommon labels May 16, 2024

Merge branch 'main' into nat/catch-test-blown-stack.

9edb436

brad-linden approved these changes May 16, 2024

View reviewed changes

Don't try to report a stack trace on stack overflow (sigh)

f06f84a

brad-linden approved these changes May 17, 2024

View reviewed changes

Promote seh_catcher() et al. to llexception.{h,cpp} for general use.

71d777e

github-actions bot added the llwindow label May 24, 2024

brad-linden approved these changes May 29, 2024

View reviewed changes

marchcat approved these changes May 30, 2024

View reviewed changes

Merge branch 'release/maint-b' into nat/catch-test-blown-stack.

dfdb883

Retargeting PR #1496 to Maint B.

github-actions bot added cmake python llappearance llcorehttp llimage llimagej2coj llmath llmessage llprimitive labels May 30, 2024

github-actions bot added llrender llplugin llxml labels May 30, 2024

nat-goodspeed changed the base branch from main to release/maint-b May 30, 2024 15:15

nat-goodspeed merged commit a8b6112 into release/maint-b May 30, 2024
14 checks passed

nat-goodspeed deleted the nat/catch-test-blown-stack branch May 30, 2024 16:36

github-actions bot locked and limited conversation to collaborators May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

nat-goodspeed commented May 16, 2024

brad-linden left a comment

brad-linden May 16, 2024

nat-goodspeed May 17, 2024

brad-linden May 16, 2024

nat-goodspeed May 17, 2024

brad-linden May 17, 2024

nat-goodspeed commented May 17, 2024

brad-linden May 17, 2024

nat-goodspeed May 17, 2024

brad-linden left a comment

Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

Attempt to diagnose stack overflow in test programs. Also, improve build.yaml logic. #1496

Conversation

nat-goodspeed commented May 16, 2024

brad-linden left a comment

Choose a reason for hiding this comment

brad-linden May 16, 2024

Choose a reason for hiding this comment

nat-goodspeed May 17, 2024

Choose a reason for hiding this comment

brad-linden May 16, 2024

Choose a reason for hiding this comment

nat-goodspeed May 17, 2024

Choose a reason for hiding this comment

brad-linden May 17, 2024

Choose a reason for hiding this comment

nat-goodspeed commented May 17, 2024

brad-linden May 17, 2024

Choose a reason for hiding this comment

nat-goodspeed May 17, 2024

Choose a reason for hiding this comment

brad-linden left a comment

Choose a reason for hiding this comment