Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] thread_pool shutdown dead lock #3851

Open
LazyDodo opened this issue May 26, 2023 · 8 comments
Open

[BUG] thread_pool shutdown dead lock #3851

LazyDodo opened this issue May 26, 2023 · 8 comments
Labels
bug Crash or wrong behavior of an existing feature. internals Internal changes, not public APIs

Comments

@LazyDodo
Copy link
Contributor

LazyDodo commented May 26, 2023

Describe the bug

Occasionally idiff.exe refuses to shutdown on the blender CI environment, previously openexr was to blame here but with recent developments there that specific problem appears to be solved, however a new one has sprung up in its place. I have a process dump from the CI environment. which shows the following stack with only the main thread still being alive.

ntdll.dll!NtYieldExecution()
KERNELBASE.dll!SwitchToThread()
[Inline Frame] OpenImageIO_Util.dll!std::this_thread::yield() Line 179
[Inline Frame] OpenImageIO_Util.dll!OpenImageIO_v2_4::atomic_backoff::operator()() Line 146
OpenImageIO_Util.dll!OpenImageIO_v2_4::spin_mutex::lock() Line 219
[Inline Frame] OpenImageIO_Util.dll!std::unique_lock<OpenImageIO_v2_4::spin_mutex>::{ctor}(OpenImageIO_v2_4::spin_mutex &) Line 137
[Inline Frame] OpenImageIO_Util.dll!OpenImageIO_v2_4::pvt::ThreadsafeQueue<std::function<void __cdecl(int)> *>::pop(std::function<void __cdecl(int)> * &) Line 63
OpenImageIO_Util.dll!OpenImageIO_v2_4::thread_pool::Impl::clear_queue() Line 178
OpenImageIO_Util.dll!OpenImageIO_v2_4::thread_pool::Impl::stop(bool isWait) Line 249
[Inline Frame] OpenImageIO_Util.dll!OpenImageIO_v2_4::thread_pool::Impl::{dtor}() Line 122
OpenImageIO_Util.dll!OpenImageIO_v2_4::thread_pool::Impl::`scalar deleting destructor'(unsigned int)
OpenImageIO_Util.dll!`OpenImageIO_v2_4::default_thread_pool'::`2'::`dynamic atexit destructor for 'shared_pool''()
ucrtbase.dll!<lambda>(void)()
ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) &,<lambda_3883c3dff614d5e0c5f61bb1ac94921c>>()
ucrtbase.dll!_execute_onexit_table()
OpenImageIO_Util.dll!dllmain_crt_process_detach(const bool is_terminating) Line 182
OpenImageIO_Util.dll!dllmain_dispatch(HINSTANCE__ * const instance, const unsigned long reason, void * const reserved) Line 293
ntdll.dll!LdrpCallInitRoutine()
ntdll.dll!LdrShutdownProcess()
ntdll.dll!RtlExitUserProcess()
kernel32.dll!ExitProcessImplementation()
ucrtbase.dll!exit_or_terminate_process()
ucrtbase.dll!common_exit()
idiff.exe!__scrt_common_main_seh() Line 295
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

To Reproduce

:loop_start
echo %TIME%
..\idiff -fail 1 -failpercent 100 -abs -scale 16 -o aaa.diff.png jpeg2000-rgba-12-jp2-90__from__rgba08.jp2 jpeg2000-rgba-12-jp2-90__from__rgba08.jp2
goto loop_start

Platform information:

  • OIIO branch/version: v2.4.11.0
  • OS: Windows 10
  • C++ compiler: MSVC 2019
  • Any non-default build flags when you build OIIO: Too many flags to list, build config can be found over here in the blender repository
@LazyDodo
Copy link
Contributor Author

LazyDodo commented Jun 3, 2023

Oh this is a "fun" one, first of I'm seemingly not the first one to run into this #2382 got close to figuring it out, but didn't cross the finish line. It had a good tip though, calling default_thread_pool()->resize(0); at the end of idiff made the problem less frequent, but alas the problem is still there (occurring every 30-40 minutes rather than every 5 mins or so with the repro posted)

What is happening is

  1. I diff runs till completion, and exits its main function, at this point in time the default thread pool is still very much alive and well.
  2. The standard windows process shut down commences which
    2a) Kills all threads except the main thread in the worst way possible
    2b) Starts to Unload any shared libraries and sends a DLL_PROCESS_DETACH message to OpenImageIO_Util.dll, which triggers cleanup and causing dtors to run which causes thread_pool::stop to run.
  3. Because 2a killed a thread at just the wrong moment, just as it was holding the spin_mutex lock inside ThreadsafeQueue this lock is now eternally held by a essentially a ghost.
  4. thread_pool::stop calls thread_pool::clear_queue() which calls this->q.pop(_f) which eternally will wait for the ghost to give up its lock, which will never happen.
  5. Deadlock

Which leads to the question, why does calling default_thread_pool()->resize(0); only partially solve the problem? The key there is when there's a decrease in threads, it will signal the threads it is time to exit, however resize() does not wait for the threads to actually join. Which most of the time gives them just enough heads up to get their affairs in order before the OS kills them making it less likely for them to be holding the lock but does not eliminate the possibility.

lessons learned: shut down your threads before exiting, you can trust the OS to clean up after you exit, but it will do so with a shotgun approach

additional notes: I have tried the same style solution as proposed in #3805 , however by the time the std::atexit callback runs the threads have already been taken out forcefully by the OS which does explain why #3805 didn't improve things for me.

@LazyDodo
Copy link
Contributor Author

This is what I feel needs to be done to resolve this problem once and for all, however given the change seems be too big to pull-off without the CLA I'll leave detailed instructions. (if my assumption is wrong, feel free to give the green light and i'll prepare a pull)

1- Have thread_pool::resize() wait for the threads to actually join by moving the member variable terminating_threads local to thread_pool::resize() and move the joining code there too.

2 - land #2382 , it was on the right track.

3- Add a function in some generic namespace called 'shutdown' that calls default_thread_pool_shutdown the function name is generic to allow it to service additional deterministic shutdown problems/cleanups (like 6) in the future.

4- Update the documentation to inform callers that 'shutdown' shall always be called before main exits (as it's the only way to have some determinism in the shutdown of oiio and any thread-pools it may have created)

5 - Update all tools to call shutdown

6 - #3805 needs to be revisited since the atexit has shown to run only after the threads have already been killed on windows, potentially still holding a lock in unpatched openexr versions. I'm certain now this pull offers no relief as is. The unpatched openexr versions do appear to have a thread join in their resize method, so a similar setup as 2 should work here given it is done before main exists.

7- Before working on this consult with @lgritz since he may disagree with all of or parts of this plan :)

A version of this that only implements 1+5(for idiff only) has been running on the blender CI env for several days now without lockups which used to lock up multiple times a day.

@lgritz
Copy link
Collaborator

lgritz commented Jun 12, 2023

Sounds like a plan to me, please do submit the PR that cleans all this up.

LazyDodo added a commit to LazyDodo/oiio that referenced this issue Jun 12, 2023
Part of solving AcademySoftwareFoundation#3851.

When decreasing the number of threads in `thread_pool::Impl::resize`
it was possible for the threads marked for termination to still be
fully alive at process exit time if the call to `resize` and the exit
of the main function were close enough to each other.

This change makes `resize` wait for the threads to actually join
guaranteeing the threads have been shut down in a orderly manner
when the `resize` function exits.
youle31 pushed a commit to UPBGE/upbge that referenced this issue Jun 12, 2023
idiff sometimes locks up while shutting down when the CPU is
oversubscribed. While blender does not rely on the idiff tool
the tests that run on the CI environment do, which causes tests
to occasionally fail due to a timeout.

The root cause is a bit complex but can be found on the oiio tracker
at AcademySoftwareFoundation/OpenImageIO#3851

This change fixes idiff by :

1- Shutting down the thread pool before the main function exits
2- Have the shutdown wait for the pool threads to actually join, to
prevent the OS from forcefully terminating them while they could
potentially still be holding a lock.
lgritz pushed a commit that referenced this issue Jun 12, 2023
…#3879)

Part of solving #3851.

When decreasing the number of threads in `thread_pool::Impl::resize` it
was possible for the threads marked for termination to still be fully
alive at process exit time if the call to `resize` and the exit of the
main function were close enough to each other.

This change makes `resize` wait for the threads to actually join
guaranteeing the threads have been shut down in a orderly manner when
the `resize` function exits.
LazyDodo added a commit to LazyDodo/oiio that referenced this issue Jun 13, 2023
This change adds a `shutdown` method to the API that can be used to perform
shutdown of the thread-pool(s) (or other clean-up activities). Applications
that utilize OpenImageIO must call it before exiting their main function to
prevent possible deadlocks during application shutdown (see AcademySoftwareFoundation#3851)

This change also updates the documentation and all oiio-tools to call the
new `shutdown` method.

This is the final change for fixing AcademySoftwareFoundation#3851
@LazyDodo LazyDodo mentioned this issue Jun 13, 2023
5 tasks
lgritz pushed a commit that referenced this issue Jun 13, 2023
This change adds a `shutdown()` method to the API that can be used to
perform shutdown of the thread-pool(s) (or other clean-up activities).
Applications that utilize OpenImageIO are advised to call it before exiting their
main function to prevent possible deadlocks during application shutdown
(see #3851)

This change also updates the documentation and all oiio-tools to call
the new `shutdown` method.

This is the final change for fixing #3851
@LazyDodo
Copy link
Contributor Author

think i'm politely gonna bow out of the #3805 rework, i dove in this morning and it's a bit of an uphill battle, the only way I could see this work is if we setup a callback system similar to std::atexit() that would be processed in the oiio::shutdown function. It surely can be done, but it would take quite a bit of finesse to get it right and involves adding another thread safe queue instance, not sure it's worth the risk/time given openexr released v3.1.8 last week that has their fix for the shutdown issue offering an upgrade path to existing users

given the fix in place doesn't work, i'd advocate reverting #3805

@lgritz
Copy link
Collaborator

lgritz commented Jun 14, 2023

@LazyDodo, please note that #3864 already mostly reverts #3805 -- it does the "atexit" stuff ONLY for Windows and only if the OpenEXR release is earlier than 3.1.8, which is when we believe the changes on the OpenEXR side fix the relevant bugs. (Do you think that's true?)

@lgritz
Copy link
Collaborator

lgritz commented Jun 14, 2023

There have been a number of patches and back-and-forth, and it's hard to follow. @LazyDodo, to be sure we're on the same page, can you please remind me, to your best assessment, which problems still persist

  • with the OpenEXR thread pool
  • with the OIIO internal thread pool

Does Blender use TBB?

OIIO has the ability to optionally use TBB, but its support is not quite 100% yet -- when enabled, all the parallel_for family of functions will use TBB, but there are still a few other uses of the internal thread pool that don't automatically use TBB when enabled. This shouldn't be hard for me to fix so that all thread pool use relies on TBB when enabled.

Also, I didn't know it until recently, but OpenEXR has for a long time had the ability to tell it to use an app's thread pool instead of its own internal one. I don't do that yet, but that's another thing we could try.

So, in short, it would be possible without a whole lot of work to make it so that all threading in both OIIO and OpenEXR could use TBB. Whether that's useful depends on (a) if we believe that TBB is likely to not have any of the same problems (I would like to think they got it right?), and (b) whether that would be an acceptable dependency from Blender's point of view.

@LazyDodo
Copy link
Contributor Author

There have been a number of patches and back-and-forth, and it's hard to follow. @LazyDodo, to be sure we're on the same page, can you please remind me, to your best assessment, which problems still persist
with the OpenEXR thread pool

Still a problem when you combine OIIO + OpenEXR 3.1.7 or earlier, since #3805 's atexit runs after the threads have been brutally shot in the face potentially holding a lock. The resizing the pool to side step the problem actually works, but due to when the atexit callback runs It's too late to actually prevent the problem.

not a problem for openEXR 3.1.8+

with the OIIO internal thread pool

This we fixed in the last few days, no longer an issue in oiio master.

We have been running OpenEXR 3.1.7 + AcademySoftwareFoundation/openexr@b18905772e applied on our CI env since may 11th, not a single stall in openexr since then so I am confident their fix is holding up.

A preliminary fix (just resize() waiting for the threads to actually join, and resizing the oiio pool to 0 at the end idiff) for the oiio/idiff version of the problem has been deployed since jun 7th (so nearly a week now) and we have had not a single stall since (multiple times a day previously) so i'm fairly confident about this one as well.

Does Blender use TBB?

Blender uses TBB, but until Blender 3.6 OIIO was build without TBB support, we enabled it hoping it would use the TBB thread pool rather than OIIO's internal one and maybe not have this problem, but as you pointed out there's still a few uses left of the internal pool, so that didn't pan out for us. Starting Blender 3.6 we ship OIIO with TBB support enabled nonetheless.

So, in short, it would be possible without a whole lot of work to make it so that all threading in both OIIO and OpenEXR could use TBB. Whether that's useful depends on (a) if we believe that TBB is likely to not have any of the same problems (I would like to think they got it right?), and (b) whether that would be an acceptable dependency from Blender's point of view.

a) No idea, given essentially the same problem has bitten us twice in a row, I'm unwilling to even guess if they got it right or not. We'd have to test / look at their code to see how well they do there.
b) Not an issue for us, as mentioned we recently started building OIIO with TBB enabled

Regardless the pending Blender 3.6 release is pretty much locked down at this point it will have to ship with the temporary patches in place.

@lgritz lgritz added bug Crash or wrong behavior of an existing feature. internals Internal changes, not public APIs labels Jun 17, 2023
@e4lam
Copy link

e4lam commented Jun 23, 2023

OIIO has the ability to optionally use TBB, but its support is not quite 100% yet -- when enabled, all the parallel_for family of functions will use TBB, but there are still a few other uses of the internal thread pool that don't automatically use TBB when enabled. This shouldn't be hard for me to fix so that all thread pool use relies on TBB when enabled.

Please do! We would really love this!

Also, I didn't know it until recently, but OpenEXR has for a long time had the ability to tell it to use an app's thread pool instead of its own internal one. I don't do that yet, but that's another thing we could try.

Reminder on my last failed attempt:
AcademySoftwareFoundation/openexr#1285

So, in short, it would be possible without a whole lot of work to make it so that all threading in both OIIO and OpenEXR could use TBB. Whether that's useful depends on (a) if we believe that TBB is likely to not have any of the same problems (I would like to think they got it right?), and (b) whether that would be an acceptable dependency from Blender's point of view.

Again, double +1 from me on using purely TBB from the number of threads that an application needs standpoint.

It should be mentioned though that using TBB does not solve the hang on shutdown problem. After working with the TBB team on this for over 2 releases (ie. even after oneapi-src/oneTBB#71 ), I gave up and made all our executables explicitly shutdown TBB prior to returning from main() or calling exit().

Then when we added OIIO, I had to again make sure to shutdown the started threadpools in the same manner. (Mind you, it wasn't bad though as we have infrastructure to do this in one place.)

enetheru added a commit to enetheru/smf_tools that referenced this issue Oct 25, 2023
Problem exists on windows, and is detailed here: AcademySoftwareFoundation/OpenImageIO#3851
requires OIIO::shutdown() at the end of executables that utilise openimageio
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Crash or wrong behavior of an existing feature. internals Internal changes, not public APIs
Projects
None yet
Development

No branches or pull requests

3 participants