Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Closed
dtenchurin opened this issue Jan 21, 2021 · 8 comments
Labels
bug Something is broken needs info Needs further information from the user

Comments

@dtenchurin
Copy link

dtenchurin commented Jan 21, 2021

Hello,

Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it.

Setup:

  1. 100 16 core boxes with single worker each started with --resources 'cpu=16'
  2. 3 task types: A: {'cpu':16}, B: {'cpu':4}, C: {'cpu':1}, tasks B and C are python subprocess calls, and task A is a function with a call to sklearn .fit().
  3. work stealing is enabled
  4. there are about 2000+ tasks in the queue normally

Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.

Example:
2021-01-21 07:52:02.021 INFO: {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO: i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')

An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s
while clients get:
OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)

When I logged in into worker and ran strace on the dask-worker process, I got this:

Looks like similar to this issue: #2880
or this #2446

I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while.

Maybe you have some suggestions in the meanwhile?

Environment:
Python 3.7.6 (default, Jan 30 2020, 03:53:38)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.version
'2021.01.0'`

  • Install method (conda, pip, source): pip
@fjetter
Copy link
Member

fjetter commented Jan 22, 2021

Not sure if this is related but we've seen deadlocks happening in a few different places recently. xref #4439 #4360 #4413

The deadlock situations I was investigating recently were mostly triggered by some kind of erroneous behaviour resulting in a missing or improperly defined state transition. In particular comm failures are something I observed as a fequent trigger on our infrastructure as well.

If you can somehow produce a minimal example that would be extremely helpful but these kind of things tend to be hard to reproduce minimally.

Maybe you have some suggestions in the meanwhile?

  • I would not call this a solution but if the comm failures are triggering this behaviour, try increasing it to a larger value
  • You mentioned that work stealing is enabled. I'd be curious to know if it also happens w/out work stealing since this is one major suspect for these kind of things

@fjetter
Copy link
Member

fjetter commented Jan 22, 2021

Lastly, we merged a fix addressing some parts of the deadlocks in #4432 you can either test against master or wait for a release which is scheduled for today, see dask/community#121

@chrisroat
Copy link
Contributor

We have a O(100s) cluster and have hit some of these same deadlock/stealing issues. Ours is adaptive, and I've noticed issues scaling up from zero -- I almost always need to keep the cluster running with some small number, which mitigates the issue (most of the time).

Generally in an initial scale up, a single task starts running and gets a ton of tasks. It seems a task needs to get running before the system understands a lot of workers are needed and more get pods get scheduled (then triggering GKE to grab more nodes... so it's a long-ish delay).

As more tasks come online, the work doesn't really get spread out -- I get a "middle finger" graph, where one task can have O(1000) tasks, while the others have O(10). Often processing may stop altogether -- and I usually then start hunting down the over-scheduled workloads and killing them.

I know this is anecdotal and hard to repro, so my apologies. But for people closer to having a repro case, perhaps trying autoscaling clusters can help tickle a bug.

@fjetter
Copy link
Member

fjetter commented Jan 22, 2021

It seems a task needs to get running before the system understands a lot of workers are needed and more get pods get scheduled

In fact, it needs to finish at least once. That's a current shortcoming of the scaling and connects to the way the scheduler measures and estimates the tasks. See #3627, #3516 and some progress in #4192

I get a "middle finger" graph, where one task can have O(1000) tasks, while the others have O(10).

:) I've seen this before myself, but am also not able to reproduce it consistently.

Quick question which might help: Which color does your "middle finger" have in these situations? (green or blue; color indicates whether the worker is flagged as saturated which impacts stealing)

@chrisroat
Copy link
Contributor

Quick question which might help: Which color does your "middle finger" have in these situations? (green or blue; color indicates whether the worker is flagged as saturated which impacts stealing)

I am pretty sure its blue.

Thanks for the pointers to the other issues/PRs

@fjetter
Copy link
Member

fjetter commented Jan 22, 2021

I am pretty sure its blue.

That means it is not recognised as saturated (saturated is loosely defined as "more tasks than CPUs/threads and tasks run long enough"). That's probably also due to the tasks not being recognised as long running and cheap tasks on a non-saturated worker are not stolen.

Probably addressing this task runtime estimation thing should help out. As I pointed out in #3627 (comment) you can set some default values for the runtime estimations which should help with this


Even if the work stealing, scaling, estimation, etc. works as expected, the "getting stuck" / deadlock thing is still an issue and likely unrelated to the other issues I mentioned

@fjetter
Copy link
Member

fjetter commented Jun 18, 2021

We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks.
These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed.
I would encourage you to try out the latest changes on main and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!

Deadlock fix #4784
Upcoming release dask/community#165

@fjetter fjetter added bug Something is broken needs info Needs further information from the user labels Jun 18, 2021
@jrbourbeau
Copy link
Member

Closing as, as mentioned by @fjetter, there have been several deadlocks fixed in distributed since this issue was originally opened. @dtenchurin let us know if you're still experiencing the same behavior with the latest distributed release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

4 participants