-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446
Comments
Not sure if this is related but we've seen deadlocks happening in a few different places recently. xref #4439 #4360 #4413 The deadlock situations I was investigating recently were mostly triggered by some kind of erroneous behaviour resulting in a missing or improperly defined state transition. In particular comm failures are something I observed as a fequent trigger on our infrastructure as well. If you can somehow produce a minimal example that would be extremely helpful but these kind of things tend to be hard to reproduce minimally.
|
Lastly, we merged a fix addressing some parts of the deadlocks in #4432 you can either test against master or wait for a release which is scheduled for today, see dask/community#121 |
We have a O(100s) cluster and have hit some of these same deadlock/stealing issues. Ours is adaptive, and I've noticed issues scaling up from zero -- I almost always need to keep the cluster running with some small number, which mitigates the issue (most of the time). Generally in an initial scale up, a single task starts running and gets a ton of tasks. It seems a task needs to get running before the system understands a lot of workers are needed and more get pods get scheduled (then triggering GKE to grab more nodes... so it's a long-ish delay). As more tasks come online, the work doesn't really get spread out -- I get a "middle finger" graph, where one task can have O(1000) tasks, while the others have O(10). Often processing may stop altogether -- and I usually then start hunting down the over-scheduled workloads and killing them. I know this is anecdotal and hard to repro, so my apologies. But for people closer to having a repro case, perhaps trying autoscaling clusters can help tickle a bug. |
In fact, it needs to finish at least once. That's a current shortcoming of the scaling and connects to the way the scheduler measures and estimates the tasks. See #3627, #3516 and some progress in #4192
:) I've seen this before myself, but am also not able to reproduce it consistently. Quick question which might help: Which color does your "middle finger" have in these situations? (green or blue; color indicates whether the worker is flagged as saturated which impacts stealing) |
I am pretty sure its blue. Thanks for the pointers to the other issues/PRs |
That means it is not recognised as Probably addressing this task runtime estimation thing should help out. As I pointed out in #3627 (comment) you can set some default values for the runtime estimations which should help with this Even if the work stealing, scaling, estimation, etc. works as expected, the "getting stuck" / deadlock thing is still an issue and likely unrelated to the other issues I mentioned |
We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks. Deadlock fix #4784 |
Closing as, as mentioned by @fjetter, there have been several deadlocks fixed in |
Hello,
Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it.
Setup:
Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.
Example:
2021-01-21 07:52:02.021 INFO: {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO: i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')
An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s
while clients get:
OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)
When I logged in into worker and ran strace on the dask-worker process, I got this:
Looks like similar to this issue: #2880
or this #2446
I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while.
Maybe you have some suggestions in the meanwhile?
Environment:
Python 3.7.6 (default, Jan 30 2020, 03:53:38)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.version
'2021.01.0'`
The text was updated successfully, but these errors were encountered: