Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805

Merged
merged 1 commit into from
Apr 22, 2023

Conversation

lgritz
Copy link
Collaborator

@lgritz lgritz commented Apr 15, 2023

We know that there are some corner cases on Windows when it can be unsafe to exit because the Ilm thread pool (which is used implicitly by OpenEXR) hangs. The exact conditions that cause this are not well understood.

Some time back, in PR 3582, we added an explicit shutdown in oiiotool right before we exit, to alleviate the problem. But this doesn't help for other apps like idiff, nor for apps that use libOpenImageIO and aren't aware of the problem, so wouldn't know to do this.

This patch changes to a different strategy: In set_exr_threads(), which will always be called any time we use openexr files, use std::atexit() to ensure that when the program terminates -- any program using OpenImageIO in cases when openexr was used -- we set the IlmThread global thread pool size to 0 to shut it down cleanly. Hopefully, that will cover even more cases where this OpenEXR bug might be tickled.

@LazyDodo
Copy link
Contributor

LazyDodo commented Apr 16, 2023

Sadly, this didn't do the trick, I had not reported this problem on the OIIO tracker since the only repro I have is "less than ideal", then again, this being a threading issue, guess that's just the nature of the beast.

First off i'll go over where it gets stuck with this with this patch applied, and then i'll share my repro

I changed the patch slightly to determine when it gets stuck during the atexit() or elsewhere.... easy enough, there was already a handy printf there just uncomment and give 'm a buddy.

    std::atexit([]() {
        print("[PRE ] EXITING and setting ilmthreads = 0\n");
        IlmThread::ThreadPool::globalThreadPool().setNumThreads(0);
        print("[POST] EXITING and setting ilmthreads = 0\n");
    });

good run

f:\idiff\tools>..\idiff candle_glass.exr candle_glass.exr
Comparing "candle_glass.exr" and "candle_glass.exr"
PASS
[PRE ] EXITING and setting ilmthreads = 0
[POST] EXITING and setting ilmthreads = 0

bad run

f:\idiff\tools>..\idiff candle_glass.exr candle_glass.exr
Comparing "candle_glass.exr" and "candle_glass.exr"
PASS
[PRE ] EXITING and setting ilmthreads = 0

Callstack on the stalled run:

ntdll.dll!NtWaitForSingleObject()
KernelBase.dll!WaitForSingleObjectEx()
IlmThread.dll!IlmThread_3_1::Semaphore::wait() Line 81
IlmThread.dll!IlmThread_3_1::`anonymous namespace'::DefaultThreadPoolProvider::finish() Line 362
IlmThread.dll!IlmThread_3_1::ThreadPool::Data::setProvider(IlmThread_3_1::ThreadPoolProvider * p) Line 559
IlmThread.dll!IlmThread_3_1::ThreadPool::setNumThreads(int count) Line 732
OpenImageIO.dll!OpenImageIO_v2_4::pvt::set_exr_threads::__l2::<lambda>() Line 291
ucrtbase.dll!<lambda>(void)()
ucrtbase.dll!__crt_seh_guarded_call<int>::operator()<<lambda_7777bce6b2f8c936911f934f8298dc43>,<lambda>(void) &,<lambda_3883c3dff614d5e0c5f61bb1ac94921c>>()
ucrtbase.dll!_execute_onexit_table()
OpenImageIO.dll!00007ffd540a7acd()
OpenImageIO.dll!00007ffd540a7c16()
ntdll.dll!LdrpCallInitRoutine()
ntdll.dll!LdrShutdownProcess()
ntdll.dll!RtlExitUserProcess()
kernel32.dll!ExitProcessImplementation()
ucrtbase.dll!exit_or_terminate_process()
ucrtbase.dll!common_exit()
idiff.exe!__scrt_common_main_seh() Line 295
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

Repro

This seems to mostly trigger when the box is under more load than it should be, the troubles on the Blender CI farm only started when we allowed ctest to run tests concurrently (one of the CI envs dropped to less than 1/3 of the time needed for tests, so while the oversubscription is certainly not an ideal situation, when shaving off 10+ minutes of a test run it's hard to argue with the time savings and additional CI builds we are able to do in a day)

So we need to make some synthetic load that mimics the CPU oversubscription... which... is not that bad actually... make a primitive spinlock.... make N threads, have them fight over it while there is no chance for them to ever get it.

// BusyBee.cpp
#include <vector>
#include <thread>
#include <Windows.h>

LONG busy = 1;

int main(int argc, char** argv)
{
    if (argc != 2)
    {
        printf("usage %s number_of_threads\n", argv[0]);
        return 0;
    }
    int threads = atoi(argv[1]);
    std::vector<std::thread> v(threads);
    for (auto& t : v)
        t = std::thread([]
            {
                while (InterlockedCompareExchange(&busy, 1, 2) == 1);
        });
    while (true); // don't worry about teardown, ctrl-c will do it and instantly drop the load to 0 
}

compile, and just start a BusyBee 16 on a console somewhere (You may need to adjust that 16 to be lower/higher depending on the number of cores on the box you are on, go too high and idiff will run too slowly to have a good throughput in the number of starting/stopping processes , too low and you won't trigger the issue, on my i7-3770 8 is the magic number)

So that takes care of the load, next up idiff, since this bug is kind of flaky to catch the only way to really catch it run many many many idiffs until one gets stuck then dump the process or attach a debugger. I used candle_glass.exr from the cycles test-suite since it's a relatively small file so it's quick for idiff to test.

from there just make a batch file

rem iloop.cmd
:loop_start
..\idiff candle_glass.exr candle_glass.exr 
goto loop_start

from a second console, start iloop.cmd and wait.....this will trigger it on my environment usually in under a minute , I feel randomly stopping busybee and restarting it sometimes helps but I'll be honest, I may very well be imagining that

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 17, 2023

I agree with you that the true, final fix is going to need to be on the OpenEXR side.

Do you think this PR still has merit and should be merged? If we believe that the previous thread shutdown in oiiotool helped with at least some of the occurrences, this should bring the same benefit to all the OIIO apps and also other apps that use OIIO to read exr files (I'm still interested in any reduction of the problem, including in cases where users have no choice but to build OIIO against an older OpenEXR that doesn't contain the fix, once it is fixed). Or do you think this change just isn't worth the trouble at all and should be abandoned?

@LazyDodo
Copy link
Contributor

That's a tough one....

Going off my personal experience I'd say don't merge it, as it doesn't appear to offer me any relief. On the other hand if people have reported relief from merging PR 3582 who am I to argue with them? Having this fix applied to all tools is without a doubt better than just oiiotool even if it's just limited relief.

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 17, 2023

Can you think of any reason why accepting would be worse than what we have now? Maybe it does nothing, at worst? But if the oiiotool side ever helped, this will do the same for the other apps.

@LazyDodo
Copy link
Contributor

I see no way it could be worse, i still think it doesn't do anything, but in the odd chance it actually does, it's good to have it equally applied to all tools and not just oiiotool.

One thing of note is when I ran idiff idiff candle_glass.exr candle_glass.exr -o test.exr it ran the atexit() printf's twice, i don't see any harm in it, but still figured i'd mention it.

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 18, 2023

Oh, I should probably do a call_once to ensure it's not called multiple times, even though it probably wouldn't be harmful. I will fix that and update the PR.

We know that there are some corner cases on Windows when it can be
unsafe to exit because the Ilm thread pool (which is used implicitly
by OpenEXR) hangs. The exact conditions that cause this are not well
understood.

Some time back, in PR 3582, we added an explicit shutdown in oiiotool
right before we exit, to alleviate the problem. But this doesn't help
for other apps like idiff, nor for apps that use libOpenImageIO and
aren't aware of the problem, so wouldn't know to do this.

This patch changes to a different strategy: In set_exr_threads(),
which will always be called any time we use openexr files, use
`std::atexit()` to ensure that when the program terminates -- any
program using OpenImageIO in cases when openexr was used -- we set the
IlmThread global thread pool size to 0 to shut it down cleanly.
Hopefully, that will cover even more cases where this OpenEXR bug
might be tickled.
@lgritz
Copy link
Collaborator Author

lgritz commented Apr 18, 2023

Ah, yes, quite broken. Depending on how many times you read or write exr files, you could get a whole bunch of those atexit calls!

PR updated.

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 21, 2023

Any final objections to this, especially now that I updated to ensure it would never call it more than once?

@LazyDodo
Copy link
Contributor

Oh no, none what so ever! sorry I wasn't aware this was stalled because of me.

@lgritz
Copy link
Collaborator Author

lgritz commented Apr 21, 2023

Oh, no, not specifically you. I just wanted to be cautious about merging if anybody might have concerns, especially considering that this is a bit of a speculative halfway workaround for a problem that really is in another project.

@lgritz lgritz merged commit eb2ceba into AcademySoftwareFoundation:master Apr 22, 2023
@lgritz lgritz deleted the lg-exrthreads branch April 23, 2023 06:16
@LazyDodo LazyDodo mentioned this pull request Jun 13, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants