-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] thread_pool shutdown dead lock #3851
Comments
Oh this is a "fun" one, first of I'm seemingly not the first one to run into this #2382 got close to figuring it out, but didn't cross the finish line. It had a good tip though, calling What is happening is
Which leads to the question, why does calling lessons learned: shut down your threads before exiting, you can trust the OS to clean up after you exit, but it will do so with a shotgun approach additional notes: I have tried the same style solution as proposed in #3805 , however by the time the |
This is what I feel needs to be done to resolve this problem once and for all, however given the change seems be too big to pull-off without the CLA I'll leave detailed instructions. (if my assumption is wrong, feel free to give the green light and i'll prepare a pull) 1- Have 2 - land #2382 , it was on the right track. 3- Add a function in some generic namespace called 'shutdown' that calls 4- Update the documentation to inform callers that 'shutdown' shall always be called before main exits (as it's the only way to have some determinism in the shutdown of oiio and any thread-pools it may have created) 5 - Update all tools to call 6 - #3805 needs to be revisited since the 7- Before working on this consult with @lgritz since he may disagree with all of or parts of this plan :) A version of this that only implements 1+5(for idiff only) has been running on the blender CI env for several days now without lockups which used to lock up multiple times a day. |
Sounds like a plan to me, please do submit the PR that cleans all this up. |
Part of solving AcademySoftwareFoundation#3851. When decreasing the number of threads in `thread_pool::Impl::resize` it was possible for the threads marked for termination to still be fully alive at process exit time if the call to `resize` and the exit of the main function were close enough to each other. This change makes `resize` wait for the threads to actually join guaranteeing the threads have been shut down in a orderly manner when the `resize` function exits.
idiff sometimes locks up while shutting down when the CPU is oversubscribed. While blender does not rely on the idiff tool the tests that run on the CI environment do, which causes tests to occasionally fail due to a timeout. The root cause is a bit complex but can be found on the oiio tracker at AcademySoftwareFoundation/OpenImageIO#3851 This change fixes idiff by : 1- Shutting down the thread pool before the main function exits 2- Have the shutdown wait for the pool threads to actually join, to prevent the OS from forcefully terminating them while they could potentially still be holding a lock.
…#3879) Part of solving #3851. When decreasing the number of threads in `thread_pool::Impl::resize` it was possible for the threads marked for termination to still be fully alive at process exit time if the call to `resize` and the exit of the main function were close enough to each other. This change makes `resize` wait for the threads to actually join guaranteeing the threads have been shut down in a orderly manner when the `resize` function exits.
This change adds a `shutdown` method to the API that can be used to perform shutdown of the thread-pool(s) (or other clean-up activities). Applications that utilize OpenImageIO must call it before exiting their main function to prevent possible deadlocks during application shutdown (see AcademySoftwareFoundation#3851) This change also updates the documentation and all oiio-tools to call the new `shutdown` method. This is the final change for fixing AcademySoftwareFoundation#3851
This change adds a `shutdown()` method to the API that can be used to perform shutdown of the thread-pool(s) (or other clean-up activities). Applications that utilize OpenImageIO are advised to call it before exiting their main function to prevent possible deadlocks during application shutdown (see #3851) This change also updates the documentation and all oiio-tools to call the new `shutdown` method. This is the final change for fixing #3851
think i'm politely gonna bow out of the #3805 rework, i dove in this morning and it's a bit of an uphill battle, the only way I could see this work is if we setup a callback system similar to given the fix in place doesn't work, i'd advocate reverting #3805 |
There have been a number of patches and back-and-forth, and it's hard to follow. @LazyDodo, to be sure we're on the same page, can you please remind me, to your best assessment, which problems still persist
Does Blender use TBB? OIIO has the ability to optionally use TBB, but its support is not quite 100% yet -- when enabled, all the parallel_for family of functions will use TBB, but there are still a few other uses of the internal thread pool that don't automatically use TBB when enabled. This shouldn't be hard for me to fix so that all thread pool use relies on TBB when enabled. Also, I didn't know it until recently, but OpenEXR has for a long time had the ability to tell it to use an app's thread pool instead of its own internal one. I don't do that yet, but that's another thing we could try. So, in short, it would be possible without a whole lot of work to make it so that all threading in both OIIO and OpenEXR could use TBB. Whether that's useful depends on (a) if we believe that TBB is likely to not have any of the same problems (I would like to think they got it right?), and (b) whether that would be an acceptable dependency from Blender's point of view. |
Still a problem when you combine OIIO + OpenEXR 3.1.7 or earlier, since #3805 's not a problem for openEXR 3.1.8+
This we fixed in the last few days, no longer an issue in oiio master. We have been running OpenEXR 3.1.7 + AcademySoftwareFoundation/openexr@b18905772e applied on our CI env since may 11th, not a single stall in openexr since then so I am confident their fix is holding up. A preliminary fix (just
Blender uses TBB, but until Blender 3.6 OIIO was build without TBB support, we enabled it hoping it would use the TBB thread pool rather than OIIO's internal one and maybe not have this problem, but as you pointed out there's still a few uses left of the internal pool, so that didn't pan out for us. Starting Blender 3.6 we ship OIIO with TBB support enabled nonetheless.
a) No idea, given essentially the same problem has bitten us twice in a row, I'm unwilling to even guess if they got it right or not. We'd have to test / look at their code to see how well they do there. Regardless the pending Blender 3.6 release is pretty much locked down at this point it will have to ship with the temporary patches in place. |
Please do! We would really love this!
Reminder on my last failed attempt:
Again, double +1 from me on using purely TBB from the number of threads that an application needs standpoint. It should be mentioned though that using TBB does not solve the hang on shutdown problem. After working with the TBB team on this for over 2 releases (ie. even after oneapi-src/oneTBB#71 ), I gave up and made all our executables explicitly shutdown TBB prior to returning from main() or calling exit(). Then when we added OIIO, I had to again make sure to shutdown the started threadpools in the same manner. (Mind you, it wasn't bad though as we have infrastructure to do this in one place.) |
Problem exists on windows, and is detailed here: AcademySoftwareFoundation/OpenImageIO#3851 requires OIIO::shutdown() at the end of executables that utilise openimageio
Describe the bug
Occasionally idiff.exe refuses to shutdown on the blender CI environment, previously openexr was to blame here but with recent developments there that specific problem appears to be solved, however a new one has sprung up in its place. I have a process dump from the CI environment. which shows the following stack with only the main thread still being alive.
To Reproduce
busybee 4
(See fix: Controlled shutdown of IlmThread pool in all apps in which we use it. #3805 for finding the optimal number of threads to use) in one consoleloop.cmd
on another consolePlatform information:
The text was updated successfully, but these errors were encountered: