-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threadpool deadlocks during shutdown on Windows #890
Comments
exit() is defined to terminate an entire process, including child threads, and transfer control to the host environment. Exit does a certain amount of cleanup that is largely implementation specific. exit also doesn't terminate instantaneously, but causes termination at "some time" in the future. A likely fatal execution order is
Are you using exit(-1) in the thread to force the deadlock you see, or is there actually an exit involved in your test scenario? |
Some of the tools inside OIIO do call exit upon certain failures (and the tests invoke those tools). The sample above is just a repro case to trigger it and gives the following callstack IlmThread-2_5.dll!IlmThread_::Semaphore::wait() Line 110
IlmThread-2_5.dll!IlmThread_::anonymous namespace'::DefaultThreadPoolProvider::finish() Line 428
IlmThread-2_5.dll!IlmThread_::ThreadPool::Data::~Data() Line 569
IlmThread-2_5.dll!IlmThread_::ThreadPool::~ThreadPool() Line 738
[External Code]
Test01.exe!main::__l2::<lambda>() Line 98 again, I'm not entirely sure the exit should be invoked from within that additional thread, it is just the easiest to repro that way. I toyed a bit with it and testing if the worker threads are alive inside DefaultThreadPoolProvider::finish (and if not, do not wait on the semaphore) seem to able to fix it though not sure if that is the correct solution. |
Can you try this? Right now there's this code in IlmThreadPool.cpp: size_t curT = _data.threads.size();
for (size_t i = 0; i != curT; ++i)
{
_data.taskSemaphore.post();
_data.threadSemaphore.wait();
}
//
// Join all the threads
//
for (size_t i = 0; i != curT; ++i)
{
_data.threads[i]->join();
delete _data.threads[i];
} instead: size_t curT = _data.threads.size();
for (size_t i = 0; i != curT; ++i)
{
if (_data.threads[i]->joinable()) {
_data.taskSemaphore.post();
_data.threadSemaphore.wait();
}
}
//
// Join all the threads
//
for (size_t i = 0; i != curT; ++i)
{
if (_data.threads[I]->joinable()) {
_data.threads[i]->join();
}
delete _data.threads[i];
} |
I'll give it a try |
@kdt3rd, will your core rewrite address this? |
This is biting my CI badly right now (only on Windows) and I'm struggling to find a workaround. WBN to fix for 3.0 (and any future 2.5 we may have as well). |
@lgritz Are you able to check if my patch helps? |
Um... I don't have a Windows machine, but I can try to rig up my CI to pull and build OpenEXR from your branch and see if that helps. Will try to get to that over the weekend. |
@lgritz, I haven't pushed the proposed patch to my fork... I'd like to see this sorted out as well, so I can try it in the next couple of days if no one beats me to it. |
Sorry for not getting back earlier about this. |
Ah, thanks very much, I appreciate it :) |
So it turned out that this patch in itself was not fixing the issue :( In the end, I did found a solution (more in line with my initial thought of "check if the threads are still alive"). So now the question remains what the way forward is. Let me know how you wish to proceed with this |
I'm still mystified about why this is popping up now, fairly reliably for you and on the Windows CI on GH, but none of the Windows users of OIIO have previously reported this problem (nor any users of OpenEXR, to the best of my knowledge). I guess as far as OpenEXR goes, a change like this is probably the right idea (though I don't know enough about Windows threads to properly review this), and perhaps in addition to having the fix in 3.0, we should backport to one last 2.5 patch. But it does bug me that it means we can't build and test OIIO against OpenEXR <= 2.4 on Windows, I wish there was some kind of (even awkward) OIIO-side "workaround" that would work even for older OpenEXR and/or without requiring the patch in OpenEXR itself. |
tbh I do not know why it suddenly started being an issue, I have been using OpenEXR for years and never seen it up until the point I started looking at the OIIO CI. Given that this is somewhat in the c++ undefined lands, it might even be something that the Windows OS pseudo-recently changed |
I added the 'join' logic to IlmThreadPool to address use-after-free issues (see PR #828). It's possible that's where this issue started to show up. |
@peterhillman The OIIO CI tests where this is consistently showing up are building against OpenEXR 2.4.1, which was tagged earlier than #828 was introduced. |
@lgritz well that's me off the hook then |
@debaetsd The proposed patch looks robust. I can't shake a feeling though, that there's a fundamental problem with the design of ThreadPool itself. I haven't had time to look at it in depth, so it's more of a Spidey Sense tingle at the moment. |
I just discovered that OIIO has a near identical Windows workaround during thread_pool shutdown AcademySoftwareFoundation/OpenImageIO#2038 . While this somewhat validates my fix though I have the same feeling that this is more a workaround of a higher level issue. |
What's the current thinking on this? Is @meshula 's proposed fix helpful and should be turned into a PR? |
I have a 2.5.5 release ready to go, do we want to include a fix for this issue? |
My proposed fix does solve the problem of signaling more threads than exist, and attempting to join threads which are not joinable, both of which are potential deadlocking errors. Whether or not it solves all scenarios, I do believe the code is more correct than what's currently checked in. I can't spot or think of any races that would make the fix invalid. So my vote is to land the fix. I'd submit a PR but my CLA is still mulching its way through the EasyCLA grinder :] |
Take a look at #921. The tests pass, but I don't have a way to test on Windows. |
I'm sorry to say that this does not fix all of the inside-OpenEXR deadlocks that the OIIO CI was experiencing on Windows. |
does or does not fix? |
does NOT. Sorry for the typo |
are the deadlocks shutdown related, or elsewhere? This patch is only shutdown related. |
Yes, only happens at shutdown. Only Windows. |
So finally got around at looking more into this. The reason why the proposed fix in #921 does not work is because the MSVC std::thread implementation of I have found a fix for this issue though it does changes the (high level) behavior of the threadpool somewhat: At this moment, when "launching" the threads (for example by calling openexr/src/lib/IlmThread/IlmThreadPool.cpp Line 382 in 6337740
However this fails in this case as the OS already killed the thread (without it ever having run and so it never posted on to that semaphore --> deadlock). My proposed fix is to change this behavior by explicitly syncing on that semaphore when launching the threads. So by the time This however changes the behavior somewhat: So the question can be if this is acceptable? IMHO it is as What is the consensus here about this? should I continue with that proposed solution or is the protentional changed performance characteristics not acceptable? |
Your proposed change adds an explicit contract that after calling setGlobalThreadCount, the global thread count will reflect the request. I like the idea and your sketch looks reasonable. Does it need any kind of bulletproofing to prevent deadlocking? In other words, are their failure modes (setting zero threads? setting thousands of threads?) that could result in not all the threads being joinable and thus setGlobalThreadCount will loop forever? |
I would probably add an additional About bulletproofing, I don't think that it needs any additional coverage for edge cases (like 0 or 10000 threads). The only real reason I see it failing if for example |
Ping on this! Anybody feel comfortable passing judgment on this PR? |
Another ping, we recently allowed ctest to run test in parallel on the blender CI farm, triggering the
Haven't tried porting over the mitigation oiio#3582 from oiiotool to idiff yet, still a fix at the root of the problem be likely the way to go here. |
I'm up for testing anything, are you referring to the quick and dirty implementation posted by @debaetsd or is there an actual PR somewhere? |
Yes, debaetsd/openexr@a2ea041 ~ when I made my comment I misunderstood that Larry was asking for comments on the PR against OIIO, I thought there was a PR associated with debaetsd's patch. If you could give it a shot, that would be great. |
Do you think this will help on the OIIO side? AcademySoftwareFoundation/OpenImageIO#3805 |
Feels like it could, I'll make a build of it and pass it to the devops guys, could be a few days to get a final verdict. |
Scratch that, I was able to repro a stall even on my local machine, AcademySoftwareFoundation/OpenImageIO#3805 does not appear to do be a viable workaround. Given this is more of an OIIO workaround I'll pick it up in the comment section of 3805 with some more details. Do note that even though we may be able to work around this issue on the OIIO side I still feel this problem is ultimately best fixed on the OpenEXR's side |
debaetsd@a2ea041 made it though testing on my local machine where i could generally trigger the issue in a minute or so. I've deployed it to our CI environment, lets give it a few days there to see if the results hold up |
There is also #1291 which I made based on Dieter's patch but simplified / changed to be how I liked, although I don't remember why it was failing |
If 1291 works, we can go with that, since you wrote there that you'd refined the patch. |
I can confirm both #1291 and debaetsd@a2ea041 appear to work on my local machine where it was generally easy to repro this problem in under a minute. |
I see that 1291 is failing all the tests. I would like to land 1291, is it possible to resolve the test failures there first? Do we know if those failures have anything to do with the change, or if they are mystery CI timeouts? |
@meshula - I am seeing a consistent abort / segfault on older VFX platform 19 builds under centos 7, which i am unable to recreate it on my normal dev system, and can't begin to explain why they would happen, I've been continuing to "simplify" the default thread pool by making it more and more "modern" C++ that should be safe from crashes. Next step, I need to spin up a docker container / VM and attempt to recreate. Will get it fixed tomorrow |
In the spirit of current year + 3, why don't we remove platform 19 (unless that's a requirement for your site?) |
2019 is the last year to "officially" specify python 2.7, but lots of us still have python 2.7 in our active pipelines at the moment so we can't let it go quite yet. |
Fix Issue #890, issue with windows shutdown when exiting quickly prior to threads actually starting. We do this by having a data block that is passed by shared_ptr to the thread to avoid dereferencing a deleted object. Further, greatly simplify the ThreadPool code by using atomic shared_ptr functions instead of trying to manually implement something similar and otherwise modernizing the thread code. Fix a few potential null dereference locations and also fix an issue when systems are overloaded enabling the TaskGroup destructor to destroy the semaphore while code is still using it, causing undefined memory corruption if some other thread immediately allocates that same block Originally based on a proposed patch by Dieter De Baets @debaetsd Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
OK, I have believe I have merged the PR that fixes this. I ended up making the thread pool much simpler in finding what turned out to be a separate (hidden) undefined behavior in TaskGroup, but hopefully I haven't regressed the fix for this. @LazyDodo if you could confirm my fix that is now on main in your setup that could repro the hang, would appreciate it |
Woohoo! Congrats on working through this |
Build 358eec4 and ran it against the repro posted in AcademySoftwareFoundation/OpenImageIO#3805 for a couple of hours, no issues! Just to be sure it was @kdt3rd 's commit that fixed it, I went back one commit ( 67f7d05 ) and that got stuck in under a minute. So it appears to do the job, the CI env will have to give the final verdict though. I may be pushing my luck here a little bit, but any chance for a backport and release for OpenEXR 3.1.x ? |
I agree, this seems like a good thing to have in a 3.1.8... |
It won't have changed the ABI of public objects at all, and is contained. Will cherry-pick it across, but will need to wait for @cary-ilm to actually release |
Fix Issue #890, issue with windows shutdown when exiting quickly prior to threads actually starting. We do this by having a data block that is passed by shared_ptr to the thread to avoid dereferencing a deleted object. Further, greatly simplify the ThreadPool code by using atomic shared_ptr functions instead of trying to manually implement something similar and otherwise modernizing the thread code. Fix a few potential null dereference locations and also fix an issue when systems are overloaded enabling the TaskGroup destructor to destroy the semaphore while code is still using it, causing undefined memory corruption if some other thread immediately allocates that same block Originally based on a proposed patch by Dieter De Baets @debaetsd Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
When using C
exit
function, the IlmThreadpool deadlocks during shutdown on Windows.This is happening because the worker threads have not run yet (and so have not posted on the "started" semaphore) when they get killed.
During shutdown, that semaphore deadlocks as it waits for N threads (but it might not have been posted on all workers)
This is especially noticeable when oversubscribing threads (for example 10 threads on a 2 core machine).
I have never seen it deadlock when calling the exit function from the main thread (though not sure if that means it cannot happen or just random coinsidence) and Windows seems to be the only platform affected.
This was initially discovered when trying to get the OpenImageIO test suite to run on Windows yet the following code can be used to reproduce (you might have to run it a couple times because, well threading :) )
The text was updated successfully, but these errors were encountered: