Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

dtenchurin · 2021-01-21T17:26:22Z

Hello,

Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it.

Setup:

100 16 core boxes with single worker each started with --resources 'cpu=16'
3 task types: A: {'cpu':16}, B: {'cpu':4}, C: {'cpu':1}, tasks B and C are python subprocess calls, and task A is a function with a call to sklearn .fit().
work stealing is enabled
there are about 2000+ tasks in the queue normally

Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.

Example:
2021-01-21 07:52:02.021 INFO: {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO: i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')

An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s
while clients get:
OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)

When I logged in into worker and ran strace on the dask-worker process, I got this:

Looks like similar to this issue: #2880
or this #2446

I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while.

Maybe you have some suggestions in the meanwhile?

Environment:
Python 3.7.6 (default, Jan 30 2020, 03:53:38)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.version
'2021.01.0'`

Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

fjetter · 2021-01-22T10:40:26Z

Not sure if this is related but we've seen deadlocks happening in a few different places recently. xref #4439 #4360 #4413

The deadlock situations I was investigating recently were mostly triggered by some kind of erroneous behaviour resulting in a missing or improperly defined state transition. In particular comm failures are something I observed as a fequent trigger on our infrastructure as well.

If you can somehow produce a minimal example that would be extremely helpful but these kind of things tend to be hard to reproduce minimally.

Maybe you have some suggestions in the meanwhile?

I would not call this a solution but if the comm failures are triggering this behaviour, try increasing it to a larger value
You mentioned that work stealing is enabled. I'd be curious to know if it also happens w/out work stealing since this is one major suspect for these kind of things

fjetter · 2021-01-22T11:19:54Z

Lastly, we merged a fix addressing some parts of the deadlocks in #4432 you can either test against master or wait for a release which is scheduled for today, see dask/community#121

chrisroat · 2021-01-22T15:00:58Z

We have a O(100s) cluster and have hit some of these same deadlock/stealing issues. Ours is adaptive, and I've noticed issues scaling up from zero -- I almost always need to keep the cluster running with some small number, which mitigates the issue (most of the time).

Generally in an initial scale up, a single task starts running and gets a ton of tasks. It seems a task needs to get running before the system understands a lot of workers are needed and more get pods get scheduled (then triggering GKE to grab more nodes... so it's a long-ish delay).

As more tasks come online, the work doesn't really get spread out -- I get a "middle finger" graph, where one task can have O(1000) tasks, while the others have O(10). Often processing may stop altogether -- and I usually then start hunting down the over-scheduled workloads and killing them.

I know this is anecdotal and hard to repro, so my apologies. But for people closer to having a repro case, perhaps trying autoscaling clusters can help tickle a bug.

fjetter · 2021-01-22T15:20:44Z

It seems a task needs to get running before the system understands a lot of workers are needed and more get pods get scheduled

In fact, it needs to finish at least once. That's a current shortcoming of the scaling and connects to the way the scheduler measures and estimates the tasks. See #3627, #3516 and some progress in #4192

I get a "middle finger" graph, where one task can have O(1000) tasks, while the others have O(10).

:) I've seen this before myself, but am also not able to reproduce it consistently.

Quick question which might help: Which color does your "middle finger" have in these situations? (green or blue; color indicates whether the worker is flagged as saturated which impacts stealing)

chrisroat · 2021-01-22T15:33:03Z

Quick question which might help: Which color does your "middle finger" have in these situations? (green or blue; color indicates whether the worker is flagged as saturated which impacts stealing)

I am pretty sure its blue.

Thanks for the pointers to the other issues/PRs

fjetter · 2021-01-22T15:37:51Z

I am pretty sure its blue.

That means it is not recognised as saturated (saturated is loosely defined as "more tasks than CPUs/threads and tasks run long enough"). That's probably also due to the tasks not being recognised as long running and cheap tasks on a non-saturated worker are not stolen.

Probably addressing this task runtime estimation thing should help out. As I pointed out in #3627 (comment) you can set some default values for the runtime estimations which should help with this

Even if the work stealing, scaling, estimation, etc. works as expected, the "getting stuck" / deadlock thing is still an issue and likely unrelated to the other issues I mentioned

fjetter · 2021-06-18T09:22:43Z

We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks.
These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed.
I would encourage you to try out the latest changes on main and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!

Deadlock fix #4784
Upcoming release dask/community#165

jrbourbeau · 2021-10-14T14:41:49Z

Closing as, as mentioned by @fjetter, there have been several deadlocks fixed in distributed since this issue was originally opened. @dtenchurin let us know if you're still experiencing the same behavior with the latest distributed release

fjetter mentioned this issue Feb 3, 2021

Add explicit fetch state to worker TaskState #4470

Merged

fjetter mentioned this issue May 11, 2021

Internal worker state transitions #4413

Closed

fjetter added bug Something is broken needs info Needs further information from the user labels Jun 18, 2021

jrbourbeau closed this as completed Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

dtenchurin commented Jan 21, 2021 •

edited

Loading

fjetter commented Jan 22, 2021

fjetter commented Jan 22, 2021

chrisroat commented Jan 22, 2021

fjetter commented Jan 22, 2021 •

edited

Loading

chrisroat commented Jan 22, 2021

fjetter commented Jan 22, 2021

fjetter commented Jun 18, 2021

jrbourbeau commented Oct 14, 2021

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Comments

dtenchurin commented Jan 21, 2021 • edited Loading

fjetter commented Jan 22, 2021

fjetter commented Jan 22, 2021

chrisroat commented Jan 22, 2021

fjetter commented Jan 22, 2021 • edited Loading

chrisroat commented Jan 22, 2021

fjetter commented Jan 22, 2021

fjetter commented Jun 18, 2021

jrbourbeau commented Oct 14, 2021

dtenchurin commented Jan 21, 2021 •

edited

Loading

fjetter commented Jan 22, 2021 •

edited

Loading