fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

lgritz · 2023-04-15T18:24:23Z

We know that there are some corner cases on Windows when it can be unsafe to exit because the Ilm thread pool (which is used implicitly by OpenEXR) hangs. The exact conditions that cause this are not well understood.

Some time back, in PR 3582, we added an explicit shutdown in oiiotool right before we exit, to alleviate the problem. But this doesn't help for other apps like idiff, nor for apps that use libOpenImageIO and aren't aware of the problem, so wouldn't know to do this.

This patch changes to a different strategy: In set_exr_threads(), which will always be called any time we use openexr files, use std::atexit() to ensure that when the program terminates -- any program using OpenImageIO in cases when openexr was used -- we set the IlmThread global thread pool size to 0 to shut it down cleanly. Hopefully, that will cover even more cases where this OpenEXR bug might be tickled.

LazyDodo · 2023-04-16T21:32:10Z

Sadly, this didn't do the trick, I had not reported this problem on the OIIO tracker since the only repro I have is "less than ideal", then again, this being a threading issue, guess that's just the nature of the beast.

First off i'll go over where it gets stuck with this with this patch applied, and then i'll share my repro

I changed the patch slightly to determine when it gets stuck during the atexit() or elsewhere.... easy enough, there was already a handy printf there just uncomment and give 'm a buddy.

    std::atexit([]() {
        print("[PRE ] EXITING and setting ilmthreads = 0\n");
        IlmThread::ThreadPool::globalThreadPool().setNumThreads(0);
        print("[POST] EXITING and setting ilmthreads = 0\n");
    });

good run

f:\idiff\tools>..\idiff candle_glass.exr candle_glass.exr
Comparing "candle_glass.exr" and "candle_glass.exr"
PASS
[PRE ] EXITING and setting ilmthreads = 0
[POST] EXITING and setting ilmthreads = 0

bad run

f:\idiff\tools>..\idiff candle_glass.exr candle_glass.exr
Comparing "candle_glass.exr" and "candle_glass.exr"
PASS
[PRE ] EXITING and setting ilmthreads = 0

Callstack on the stalled run:

ntdll.dll!NtWaitForSingleObject()
KernelBase.dll!WaitForSingleObjectEx()
IlmThread.dll!IlmThread_3_1::Semaphore::wait() Line 81
IlmThread.dll!IlmThread_3_1::`anonymous namespace'::DefaultThreadPoolProvider::finish() Line 362
IlmThread.dll!IlmThread_3_1::ThreadPool::Data::setProvider(IlmThread_3_1::ThreadPoolProvider * p) Line 559
IlmThread.dll!IlmThread_3_1::ThreadPool::setNumThreads(int count) Line 732
OpenImageIO.dll!OpenImageIO_v2_4::pvt::set_exr_threads::__l2::<lambda>() Line 291
ucrtbase.dll!<lambda>(void)()
ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) &,<lambda_3883c3dff614d5e0c5f61bb1ac94921c>>()
ucrtbase.dll!_execute_onexit_table()
OpenImageIO.dll!00007ffd540a7acd()
OpenImageIO.dll!00007ffd540a7c16()
ntdll.dll!LdrpCallInitRoutine()
ntdll.dll!LdrShutdownProcess()
ntdll.dll!RtlExitUserProcess()
kernel32.dll!ExitProcessImplementation()
ucrtbase.dll!exit_or_terminate_process()
ucrtbase.dll!common_exit()
idiff.exe!__scrt_common_main_seh() Line 295
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

Repro

This seems to mostly trigger when the box is under more load than it should be, the troubles on the Blender CI farm only started when we allowed ctest to run tests concurrently (one of the CI envs dropped to less than 1/3 of the time needed for tests, so while the oversubscription is certainly not an ideal situation, when shaving off 10+ minutes of a test run it's hard to argue with the time savings and additional CI builds we are able to do in a day)

So we need to make some synthetic load that mimics the CPU oversubscription... which... is not that bad actually... make a primitive spinlock.... make N threads, have them fight over it while there is no chance for them to ever get it.

// BusyBee.cpp
#include <vector>
#include <thread>
#include <Windows.h>

LONG busy = 1;

int main(int argc, char** argv)
{
    if (argc != 2)
    {
        printf("usage %s number_of_threads\n", argv[0]);
        return 0;
    }
    int threads = atoi(argv[1]);
    std::vector<std::thread> v(threads);
    for (auto& t : v)
        t = std::thread([]
            {
                while (InterlockedCompareExchange(&busy, 1, 2) == 1);
        });
    while (true); // don't worry about teardown, ctrl-c will do it and instantly drop the load to 0 
}

compile, and just start a BusyBee 16 on a console somewhere (You may need to adjust that 16 to be lower/higher depending on the number of cores on the box you are on, go too high and idiff will run too slowly to have a good throughput in the number of starting/stopping processes , too low and you won't trigger the issue, on my i7-3770 8 is the magic number)

So that takes care of the load, next up idiff, since this bug is kind of flaky to catch the only way to really catch it run many many many idiffs until one gets stuck then dump the process or attach a debugger. I used candle_glass.exr from the cycles test-suite since it's a relatively small file so it's quick for idiff to test.

from there just make a batch file

rem iloop.cmd
:loop_start
..\idiff candle_glass.exr candle_glass.exr 
goto loop_start

from a second console, start iloop.cmd and wait.....this will trigger it on my environment usually in under a minute , I feel randomly stopping busybee and restarting it sometimes helps but I'll be honest, I may very well be imagining that

lgritz · 2023-04-17T20:14:43Z

I agree with you that the true, final fix is going to need to be on the OpenEXR side.

Do you think this PR still has merit and should be merged? If we believe that the previous thread shutdown in oiiotool helped with at least some of the occurrences, this should bring the same benefit to all the OIIO apps and also other apps that use OIIO to read exr files (I'm still interested in any reduction of the problem, including in cases where users have no choice but to build OIIO against an older OpenEXR that doesn't contain the fix, once it is fixed). Or do you think this change just isn't worth the trouble at all and should be abandoned?

LazyDodo · 2023-04-17T20:54:41Z

That's a tough one....

Going off my personal experience I'd say don't merge it, as it doesn't appear to offer me any relief. On the other hand if people have reported relief from merging PR 3582 who am I to argue with them? Having this fix applied to all tools is without a doubt better than just oiiotool even if it's just limited relief.

lgritz · 2023-04-17T21:33:26Z

Can you think of any reason why accepting would be worse than what we have now? Maybe it does nothing, at worst? But if the oiiotool side ever helped, this will do the same for the other apps.

LazyDodo · 2023-04-17T23:54:11Z

I see no way it could be worse, i still think it doesn't do anything, but in the odd chance it actually does, it's good to have it equally applied to all tools and not just oiiotool.

One thing of note is when I ran idiff idiff candle_glass.exr candle_glass.exr -o test.exr it ran the atexit() printf's twice, i don't see any harm in it, but still figured i'd mention it.

lgritz · 2023-04-18T00:55:26Z

Oh, I should probably do a call_once to ensure it's not called multiple times, even though it probably wouldn't be harmful. I will fix that and update the PR.

We know that there are some corner cases on Windows when it can be unsafe to exit because the Ilm thread pool (which is used implicitly by OpenEXR) hangs. The exact conditions that cause this are not well understood. Some time back, in PR 3582, we added an explicit shutdown in oiiotool right before we exit, to alleviate the problem. But this doesn't help for other apps like idiff, nor for apps that use libOpenImageIO and aren't aware of the problem, so wouldn't know to do this. This patch changes to a different strategy: In set_exr_threads(), which will always be called any time we use openexr files, use `std::atexit()` to ensure that when the program terminates -- any program using OpenImageIO in cases when openexr was used -- we set the IlmThread global thread pool size to 0 to shut it down cleanly. Hopefully, that will cover even more cases where this OpenEXR bug might be tickled.

lgritz · 2023-04-18T00:59:21Z

Ah, yes, quite broken. Depending on how many times you read or write exr files, you could get a whole bunch of those atexit calls!

PR updated.

lgritz · 2023-04-21T17:35:44Z

Any final objections to this, especially now that I updated to ensure it would never call it more than once?

LazyDodo · 2023-04-21T18:22:19Z

Oh no, none what so ever! sorry I wasn't aware this was stalled because of me.

lgritz · 2023-04-21T19:54:25Z

Oh, no, not specifically you. I just wanted to be cautious about merging if anybody might have concerns, especially considering that this is a bit of a speculative halfway workaround for a problem that really is in another project.

lgritz mentioned this pull request Apr 15, 2023

Threadpool deadlocks during shutdown on Windows AcademySoftwareFoundation/openexr#890

Open

lgritz force-pushed the lg-exrthreads branch from 0150369 to 7c464dd Compare April 18, 2023 00:58

lgritz force-pushed the lg-exrthreads branch from 7c464dd to e012aa5 Compare April 18, 2023 00:59

lgritz merged commit eb2ceba into AcademySoftwareFoundation:master Apr 22, 2023

lgritz deleted the lg-exrthreads branch April 23, 2023 06:16

LazyDodo mentioned this pull request May 26, 2023

[BUG] thread_pool shutdown dead lock #3851

Open

LazyDodo mentioned this pull request Jun 13, 2023

API: Add shutdown method #3882

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

lgritz commented Apr 15, 2023

LazyDodo commented Apr 16, 2023 •

edited

Loading

lgritz commented Apr 17, 2023

LazyDodo commented Apr 17, 2023

lgritz commented Apr 17, 2023

LazyDodo commented Apr 17, 2023

lgritz commented Apr 18, 2023

lgritz commented Apr 18, 2023

lgritz commented Apr 21, 2023

LazyDodo commented Apr 21, 2023

lgritz commented Apr 21, 2023 •

edited

Loading

fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

Conversation

lgritz commented Apr 15, 2023

LazyDodo commented Apr 16, 2023 • edited Loading

Repro

lgritz commented Apr 17, 2023

LazyDodo commented Apr 17, 2023

lgritz commented Apr 17, 2023

LazyDodo commented Apr 17, 2023

lgritz commented Apr 18, 2023

lgritz commented Apr 18, 2023

lgritz commented Apr 21, 2023

LazyDodo commented Apr 21, 2023

lgritz commented Apr 21, 2023 • edited Loading

LazyDodo commented Apr 16, 2023 •

edited

Loading

lgritz commented Apr 21, 2023 •

edited

Loading