Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threadpool deadlocks during shutdown on Windows #890

Open
debaetsd opened this issue Jan 6, 2021 · 52 comments
Open

Threadpool deadlocks during shutdown on Windows #890

debaetsd opened this issue Jan 6, 2021 · 52 comments

Comments

@debaetsd
Copy link

debaetsd commented Jan 6, 2021

When using C exit function, the IlmThreadpool deadlocks during shutdown on Windows.

This is happening because the worker threads have not run yet (and so have not posted on the "started" semaphore) when they get killed.
During shutdown, that semaphore deadlocks as it waits for N threads (but it might not have been posted on all workers)
This is especially noticeable when oversubscribing threads (for example 10 threads on a 2 core machine).

I have never seen it deadlock when calling the exit function from the main thread (though not sure if that means it cannot happen or just random coinsidence) and Windows seems to be the only platform affected.

This was initially discovered when trying to get the OpenImageIO test suite to run on Windows yet the following code can be used to reproduce (you might have to run it a couple times because, well threading :) )

int main(int argc, char** argv)
{
	std::thread t([]() {exit(-1);	});
	Imf::setGlobalThreadCount(5);
	t.join();
}
@meshula
Copy link
Contributor

meshula commented Jan 6, 2021

exit() is defined to terminate an entire process, including child threads, and transfer control to the host environment. Exit does a certain amount of cleanup that is largely implementation specific. exit also doesn't terminate instantaneously, but causes termination at "some time" in the future. A likely fatal execution order is

main stdlib startup
initialize t, queue lambda for execution
run lambda
invoke exit
clean up stdlib and other execution state
leave thread
stdlib already cleaned up, thread library in undefined state
Imf::setGlobalThreadCount executes mutex
sadness

Are you using exit(-1) in the thread to force the deadlock you see, or is there actually an exit involved in your test scenario?

@debaetsd
Copy link
Author

debaetsd commented Jan 6, 2021

Some of the tools inside OIIO do call exit upon certain failures (and the tests invoke those tools).
Besides the fact that OIIO should strive for a cleaner error strategy, it can be an issue.

The sample above is just a repro case to trigger it and gives the following callstack

	IlmThread-2_5.dll!IlmThread_::Semaphore::wait() Line 110
	IlmThread-2_5.dll!IlmThread_::anonymous namespace'::DefaultThreadPoolProvider::finish() Line 428
 	IlmThread-2_5.dll!IlmThread_::ThreadPool::Data::~Data() Line 569	
	IlmThread-2_5.dll!IlmThread_::ThreadPool::~ThreadPool() Line 738
 	[External Code]	
 	Test01.exe!main::__l2::<lambda>() Line 98

again, I'm not entirely sure the exit should be invoked from within that additional thread, it is just the easiest to repro that way.

I toyed a bit with it and testing if the worker threads are alive inside DefaultThreadPoolProvider::finish (and if not, do not wait on the semaphore) seem to able to fix it though not sure if that is the correct solution.

@meshula
Copy link
Contributor

meshula commented Jan 6, 2021

Can you try this?

Right now there's this code in IlmThreadPool.cpp:

    size_t curT = _data.threads.size();
    for (size_t i = 0; i != curT; ++i)
    {
        _data.taskSemaphore.post();
        _data.threadSemaphore.wait();
    }

    //
    // Join all the threads
    //
    for (size_t i = 0; i != curT; ++i)
    {
        _data.threads[i]->join();
        delete _data.threads[i];
    }

instead:

    size_t curT = _data.threads.size();
    for (size_t i = 0; i != curT; ++i)
    {
        if (_data.threads[i]->joinable()) {
            _data.taskSemaphore.post();
           _data.threadSemaphore.wait();
       }
    }

    //
    // Join all the threads
    //
    for (size_t i = 0; i != curT; ++i)
    {
        if (_data.threads[I]->joinable()) {
            _data.threads[i]->join();
        }
        delete _data.threads[i];
    }

@debaetsd
Copy link
Author

I'll give it a try

@cary-ilm
Copy link
Member

@kdt3rd, will your core rewrite address this?

@lgritz
Copy link
Contributor

lgritz commented Jan 15, 2021

This is biting my CI badly right now (only on Windows) and I'm struggling to find a workaround. WBN to fix for 3.0 (and any future 2.5 we may have as well).

@meshula
Copy link
Contributor

meshula commented Jan 15, 2021

@lgritz Are you able to check if my patch helps?

@lgritz
Copy link
Contributor

lgritz commented Jan 16, 2021

Um... I don't have a Windows machine, but I can try to rig up my CI to pull and build OpenEXR from your branch and see if that helps. Will try to get to that over the weekend.

@meshula
Copy link
Contributor

meshula commented Jan 16, 2021

@lgritz, I haven't pushed the proposed patch to my fork... I'd like to see this sorted out as well, so I can try it in the next couple of days if no one beats me to it.

@debaetsd
Copy link
Author

Sorry for not getting back earlier about this.
I did a quick test with that patch (https://github.com/debaetsd/openexr/tree/winthread) and the dummy repro case from above seems to work now.
Testing it inside oiio atm

@meshula
Copy link
Contributor

meshula commented Jan 16, 2021

Ah, thanks very much, I appreciate it :)

@debaetsd
Copy link
Author

So it turned out that this patch in itself was not fixing the issue :(

In the end, I did found a solution (more in line with my initial thought of "check if the threads are still alive").
This is however a Windows specific solution atm as I don't think it can be done with just using std::thread (though as this seems Windows only issue, that might not be an issue?)
This makes the OIIO windows CI pass (well tbf, not exactly pass but not deadlock, see https://github.com/debaetsd/oiio/runs/1714550750. required a bit of
whac-a-mole code changes as it seems oiio + OpenExr master does not compile on windows).

So now the question remains what the way forward is.
I could clean up my changes (master...debaetsd:winthread) though somebody here would need to validate them.
Both on a "is it ok to have some Windows specific stuff in here" level as on a higher "is this not breaking the expected behavior of the ThreadPool class" level.

Let me know how you wish to proceed with this

@lgritz
Copy link
Contributor

lgritz commented Jan 16, 2021

I'm still mystified about why this is popping up now, fairly reliably for you and on the Windows CI on GH, but none of the Windows users of OIIO have previously reported this problem (nor any users of OpenEXR, to the best of my knowledge). I guess as far as OpenEXR goes, a change like this is probably the right idea (though I don't know enough about Windows threads to properly review this), and perhaps in addition to having the fix in 3.0, we should backport to one last 2.5 patch. But it does bug me that it means we can't build and test OIIO against OpenEXR <= 2.4 on Windows, I wish there was some kind of (even awkward) OIIO-side "workaround" that would work even for older OpenEXR and/or without requiring the patch in OpenEXR itself.

@debaetsd
Copy link
Author

tbh I do not know why it suddenly started being an issue, I have been using OpenEXR for years and never seen it up until the point I started looking at the OIIO CI. Given that this is somewhat in the c++ undefined lands, it might even be something that the Windows OS pseudo-recently changed

@peterhillman
Copy link
Contributor

I added the 'join' logic to IlmThreadPool to address use-after-free issues (see PR #828). It's possible that's where this issue started to show up.

@lgritz
Copy link
Contributor

lgritz commented Jan 17, 2021

@peterhillman The OIIO CI tests where this is consistently showing up are building against OpenEXR 2.4.1, which was tagged earlier than #828 was introduced.

@peterhillman
Copy link
Contributor

@lgritz well that's me off the hook then

@meshula
Copy link
Contributor

meshula commented Jan 18, 2021

@debaetsd The proposed patch looks robust. I can't shake a feeling though, that there's a fundamental problem with the design of ThreadPool itself. I haven't had time to look at it in depth, so it's more of a Spidey Sense tingle at the moment.

@debaetsd
Copy link
Author

I just discovered that OIIO has a near identical Windows workaround during thread_pool shutdown AcademySoftwareFoundation/OpenImageIO#2038 . While this somewhat validates my fix though I have the same feeling that this is more a workaround of a higher level issue.

@lgritz
Copy link
Contributor

lgritz commented Feb 2, 2021

What's the current thinking on this? Is @meshula 's proposed fix helpful and should be turned into a PR?
Does that totally fix the problem, or are there other issues still lurking?

@cary-ilm
Copy link
Member

cary-ilm commented Feb 9, 2021

I have a 2.5.5 release ready to go, do we want to include a fix for this issue?

@meshula
Copy link
Contributor

meshula commented Feb 9, 2021

My proposed fix does solve the problem of signaling more threads than exist, and attempting to join threads which are not joinable, both of which are potential deadlocking errors. Whether or not it solves all scenarios, I do believe the code is more correct than what's currently checked in. I can't spot or think of any races that would make the fix invalid. So my vote is to land the fix. I'd submit a PR but my CLA is still mulching its way through the EasyCLA grinder :]

@cary-ilm
Copy link
Member

Take a look at #921. The tests pass, but I don't have a way to test on Windows.

@lgritz
Copy link
Contributor

lgritz commented Feb 15, 2021

I'm sorry to say that this does not fix all of the inside-OpenEXR deadlocks that the OIIO CI was experiencing on Windows.

@meshula
Copy link
Contributor

meshula commented Feb 15, 2021

does or does not fix?

@lgritz
Copy link
Contributor

lgritz commented Feb 15, 2021

does NOT. Sorry for the typo

@meshula
Copy link
Contributor

meshula commented Feb 16, 2021

are the deadlocks shutdown related, or elsewhere? This patch is only shutdown related.

@lgritz
Copy link
Contributor

lgritz commented Feb 16, 2021

Yes, only happens at shutdown. Only Windows.

@debaetsd
Copy link
Author

So finally got around at looking more into this.

The reason why the proposed fix in #921 does not work is because the MSVC std::thread implementation of joinable is looking if it has a valid (integer based) ID it got when creating the thread (see thrdaddr on https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/beginthread-beginthreadex?view=msvc-160). However this does not really account for the fact that if the OS kills the thread, this variable does not know that (because it is just an integer ID).

I have found a fix for this issue though it does changes the (high level) behavior of the threadpool somewhat:

At this moment, when "launching" the threads (for example by calling setGlobalThreadCount), the threads are launched without any acknowledgment that they have actually run. It is pretty much fire&forget. It is only during Finish that this is checked:

// If we do not wait before we destroy the threads then it's

However this fails in this case as the OS already killed the thread (without it ever having run and so it never posted on to that semaphore --> deadlock).

My proposed fix is to change this behavior by explicitly syncing on that semaphore when launching the threads. So by the time setGlobalThreadCount returns, we know for sure that the threads themselves are running. This also means that we no longer have to do this sync during Finish.
see debaetsd@a2ea041 for a quick & dirty implementation of this

This however changes the behavior somewhat: setGlobalThreadCount will now be waiting on those threads and that could influence performance expectations. For example where increasing the number of threads is now a non-syncing event, my proposed fix could make it slower (as it would have to wait for the semaphore to acknowledge the threads have started).

So the question can be if this is acceptable? IMHO it is as setGlobalThreadCount is a non-trivial function to begin with so it potentially being slow is not that mind boggling.

What is the consensus here about this? should I continue with that proposed solution or is the protentional changed performance characteristics not acceptable?

@meshula
Copy link
Contributor

meshula commented Feb 26, 2021

Your proposed change adds an explicit contract that after calling setGlobalThreadCount, the global thread count will reflect the request. I like the idea and your sketch looks reasonable. Does it need any kind of bulletproofing to prevent deadlocking? In other words, are their failure modes (setting zero threads? setting thousands of threads?) that could result in not all the threads being joinable and thus setGlobalThreadCount will loop forever?

@debaetsd
Copy link
Author

debaetsd commented Mar 2, 2021

I would probably add an additional if (_data.hasThreads) statement around the loop (though this might not strictly be needed).

About bulletproofing, I don't think that it needs any additional coverage for edge cases (like 0 or 10000 threads). The only real reason I see it failing if for example new DefaultWorkerThread would fail (out of memory?) though that would be rather catastrophic anyway. If the threads themselves are not able to start, joinable should be able to catch this

@lgritz
Copy link
Contributor

lgritz commented Oct 5, 2022

Ping on this! Anybody feel comfortable passing judgment on this PR?

@LazyDodo
Copy link

Another ping, we recently allowed ctest to run test in parallel on the blender CI farm, triggering the oversubscribing threads clause the opening post mentioned causing idiff to occasionally stall with the following callstack

0:000> k
 # Child-SP          RetAddr               Call Site
00 000000ff`924ff268 00007ffd`f10e306e     ntdll!NtWaitForSingleObject+0x14
01 000000ff`924ff270 00007ffd`e068a67a     KERNELBASE!WaitForSingleObjectEx+0x8e
02 000000ff`924ff310 00007ffd`e0687b54     IlmThread_d!IlmThread_3_1::Semaphore::wait+0x2a
03 000000ff`924ff4f0 00007ffd`e0691cb1     IlmThread_d!std::_Atomic_integral_facade<int>::fetch_sub+0x124
04 000000ff`924ff540 00007ffd`37662931     IlmThread_d!`dllmain_dispatch'::`1'::filt$0+0x10d1
05 000000ff`924ff570 00007ffd`37662105     ucrtbased!<lambda_d121dba8a4adeaf3a9819e48611155df>::operator()+0x131 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 206] 
06 000000ff`924ff600 00007ffd`37662257     ucrtbased!__crt_seh_guarded_call<int>::operator()<<lambda_6a47f4c8fd0152770a780fc1d70204eb>,<lambda_d121dba8a4adeaf3a9819e48611155df> &,<lambda_6aaa2265f5b6a89667e7d7630012e97a> >+0x35 [VCCRT\vcruntime\inc\internal_shared.h @ 204] 
07 000000ff`924ff640 00007ffd`37662b44     ucrtbased!__acrt_lock_and_call<<lambda_d121dba8a4adeaf3a9819e48611155df> >+0x57 [minkernel\crts\ucrt\inc\corecrt_internal.h @ 974] 
08 000000ff`924ff6a0 00007ffd`e068b019     ucrtbased!_execute_onexit_table+0x34 [minkernel\crts\ucrt\src\appcrt\startup\onexit.cpp @ 231] 
09 000000ff`924ff6f0 00007ffd`e068bcdc     IlmThread_d!__scrt_dllmain_uninitialize_c+0x19 [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\utility\utility.cpp @ 399] 
0a 000000ff`924ff720 00007ffd`e068bab7     IlmThread_d!dllmain_crt_process_detach+0x4c [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 182] 
0b 000000ff`924ff760 00007ffd`e068be4e     IlmThread_d!dllmain_crt_dispatch+0x67 [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 220] 
0c 000000ff`924ff7a0 00007ffd`e068bfc1     IlmThread_d!dllmain_dispatch+0xfe [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 293] 
0d 000000ff`924ff7f0 00007ffd`f3709a1d     IlmThread_d!_DllMainCRTStartup+0x31 [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\dll_dllmain.cpp @ 335] 
0e 000000ff`924ff820 00007ffd`f374dcec     ntdll!LdrpCallInitRoutine+0x61
0f 000000ff`924ff890 00007ffd`f374da7d     ntdll!LdrShutdownProcess+0x24c
10 000000ff`924ff990 00007ffd`f26de86b     ntdll!RtlExitUserProcess+0xad
11 000000ff`924ff9c0 00007ffd`37661bba     kernel32!ExitProcessImplementation+0xb
12 000000ff`924ff9f0 00007ffd`37661b65     ucrtbased!exit_or_terminate_process+0x3a [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 144] 
13 000000ff`924ffa20 00007ffd`37661f06     ucrtbased!common_exit+0x85 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 280] 
14 000000ff`924ffa80 00007ff7`a81e2437     ucrtbased!exit+0x16 [minkernel\crts\ucrt\src\appcrt\startup\exit.cpp @ 294] 
15 000000ff`924ffab0 00007ff7`a81e22de     idiff!__scrt_common_main_seh+0x147 [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 297] 
16 000000ff`924ffb20 00007ff7`a81e260e     idiff!__scrt_common_main+0xe [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331] 
17 000000ff`924ffb50 00007ffd`f26d7614     idiff!mainCRTStartup+0xe [d:\a01\_work\3\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17] 
18 000000ff`924ffb80 00007ffd`f37426a1     kernel32!BaseThreadInitThunk+0x14
19 000000ff`924ffbb0 00000000`00000000     ntdll!RtlUserThreadStart+0x21

Haven't tried porting over the mitigation oiio#3582 from oiiotool to idiff yet, still a fix at the root of the problem be likely the way to go here.

@meshula
Copy link
Contributor

meshula commented Apr 14, 2023

@LazyDodo Where we left this one was @lgritz asking if anyone has a judgement on the PR. Are you in a position to test this PR in your scenario? If it fixes the problem, then we should land this patch.

@LazyDodo
Copy link

I'm up for testing anything, are you referring to the quick and dirty implementation posted by @debaetsd or is there an actual PR somewhere?

@meshula
Copy link
Contributor

meshula commented Apr 15, 2023

Yes, debaetsd/openexr@a2ea041 ~ when I made my comment I misunderstood that Larry was asking for comments on the PR against OIIO, I thought there was a PR associated with debaetsd's patch.

If you could give it a shot, that would be great.

@lgritz
Copy link
Contributor

lgritz commented Apr 15, 2023

Do you think this will help on the OIIO side? AcademySoftwareFoundation/OpenImageIO#3805

@LazyDodo
Copy link

Feels like it could, I'll make a build of it and pass it to the devops guys, could be a few days to get a final verdict.

@LazyDodo
Copy link

Scratch that, I was able to repro a stall even on my local machine, AcademySoftwareFoundation/OpenImageIO#3805 does not appear to do be a viable workaround. Given this is more of an OIIO workaround I'll pick it up in the comment section of 3805 with some more details.

Do note that even though we may be able to work around this issue on the OIIO side I still feel this problem is ultimately best fixed on the OpenEXR's side

@LazyDodo
Copy link

LazyDodo commented Apr 17, 2023

debaetsd@a2ea041 made it though testing on my local machine where i could generally trigger the issue in a minute or so. I've deployed it to our CI environment, lets give it a few days there to see if the results hold up

@kdt3rd
Copy link
Contributor

kdt3rd commented Apr 17, 2023

There is also #1291 which I made based on Dieter's patch but simplified / changed to be how I liked, although I don't remember why it was failing

@meshula
Copy link
Contributor

meshula commented Apr 17, 2023

If 1291 works, we can go with that, since you wrote there that you'd refined the patch.

@LazyDodo
Copy link

I can confirm both #1291 and debaetsd@a2ea041 appear to work on my local machine where it was generally easy to repro this problem in under a minute.

@meshula
Copy link
Contributor

meshula commented Apr 18, 2023

I see that 1291 is failing all the tests.

I would like to land 1291, is it possible to resolve the test failures there first? Do we know if those failures have anything to do with the change, or if they are mystery CI timeouts?

@kdt3rd
Copy link
Contributor

kdt3rd commented Apr 18, 2023

@meshula - I am seeing a consistent abort / segfault on older VFX platform 19 builds under centos 7, which i am unable to recreate it on my normal dev system, and can't begin to explain why they would happen, I've been continuing to "simplify" the default thread pool by making it more and more "modern" C++ that should be safe from crashes. Next step, I need to spin up a docker container / VM and attempt to recreate. Will get it fixed tomorrow

@meshula
Copy link
Contributor

meshula commented Apr 19, 2023

In the spirit of current year + 3, why don't we remove platform 19 (unless that's a requirement for your site?)

@lgritz
Copy link
Contributor

lgritz commented Apr 19, 2023

2019 is the last year to "officially" specify python 2.7, but lots of us still have python 2.7 in our active pipelines at the moment so we can't let it go quite yet.

kdt3rd added a commit that referenced this issue Apr 22, 2023
Fix Issue #890, issue with windows shutdown when exiting quickly prior to
threads actually starting. We do this by having a data block that is passed
by shared_ptr to the thread to avoid dereferencing a deleted object.

Further, greatly simplify the ThreadPool code by using atomic shared_ptr
functions instead of trying to manually implement something similar and
otherwise modernizing the thread code.

Fix a few potential null dereference locations and also fix an issue when
systems are overloaded enabling the TaskGroup destructor to destroy
the semaphore while code is still using it, causing undefined memory
corruption if some other thread immediately allocates that same block

Originally based on a proposed patch by Dieter De Baets @debaetsd

Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
@kdt3rd
Copy link
Contributor

kdt3rd commented Apr 22, 2023

OK, I have believe I have merged the PR that fixes this. I ended up making the thread pool much simpler in finding what turned out to be a separate (hidden) undefined behavior in TaskGroup, but hopefully I haven't regressed the fix for this. @LazyDodo if you could confirm my fix that is now on main in your setup that could repro the hang, would appreciate it

@meshula
Copy link
Contributor

meshula commented Apr 22, 2023

Woohoo! Congrats on working through this

@LazyDodo
Copy link

Build 358eec4 and ran it against the repro posted in AcademySoftwareFoundation/OpenImageIO#3805 for a couple of hours, no issues! Just to be sure it was @kdt3rd 's commit that fixed it, I went back one commit ( 67f7d05 ) and that got stuck in under a minute. So it appears to do the job, the CI env will have to give the final verdict though.

I may be pushing my luck here a little bit, but any chance for a backport and release for OpenEXR 3.1.x ?

@meshula
Copy link
Contributor

meshula commented Apr 22, 2023

I agree, this seems like a good thing to have in a 3.1.8...

@kdt3rd
Copy link
Contributor

kdt3rd commented Apr 23, 2023

It won't have changed the ABI of public objects at all, and is contained. Will cherry-pick it across, but will need to wait for @cary-ilm to actually release

kdt3rd added a commit that referenced this issue Apr 23, 2023
Fix Issue #890, issue with windows shutdown when exiting quickly prior to
threads actually starting. We do this by having a data block that is passed
by shared_ptr to the thread to avoid dereferencing a deleted object.

Further, greatly simplify the ThreadPool code by using atomic shared_ptr
functions instead of trying to manually implement something similar and
otherwise modernizing the thread code.

Fix a few potential null dereference locations and also fix an issue when
systems are overloaded enabling the TaskGroup destructor to destroy
the semaphore while code is still using it, causing undefined memory
corruption if some other thread immediately allocates that same block

Originally based on a proposed patch by Dieter De Baets @debaetsd

Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants