[DNM] Peer-to-peer shuffle design #5435

gjoseph92 · 2021-10-18T03:08:24Z

Here's a work-in-progress proposal for how we can implement the scalable peer-to-peer shuffling prototyped in dask/dask#8223 in a maintainable and stable way.

The primary problem with the code in dask/dask#8223 is that it's brittle and non-maintainable in its current state. (And also not reliable.) So the primary purpose of this PR is to discuss how we can write something similar, implementation-wise, but in a way that integrates better and we're happy to maintain long-term.

This is meant to be relatively high-level and architectural, and we'd like to keep the conversation away from implementation details except where necessary.

You can view the rendered markdown at https://github.com/gjoseph92/distributed/blob/p2p-shuffle/proposal/distributed/shuffle/shuffle-design.md.

cc @fjetter @jcrist @crusaderky @jrbourbeau

gjoseph92 · 2021-10-18T03:12:02Z

I'll be out next week, so I've given @fjetter access to https://github.com/gjoseph92/distributed (and therefore this branch) for now so he can push changes.

Alternatively, if folks would prefer just copying it all into a Google doc, I'm perfectly happy with that too.

distributed/shuffle/shuffle-design.md

fjetter · 2021-10-28T14:15:52Z

distributed/shuffle/shuffle-design.md

+
+Because most the tasks in the shuffle graph are impure and run for their side effects, restarting an in-progress shuffle requires rerunning _every_ task involved, even ones that appear to have successfully transitioned to `memory` and whose "results" are stored on non-yet-dead workers.
+
+Additionally, cleanly stopping a running shuffle takes more than just releasing the shuffle tasks from memory: since there's out-of-band processing going on, the `ShuffleExtension` has to be informed in some way that it needs to stop doing whatever it's doing in the background, and clear out its buffers. Also, executing tasks may be blocking on the `ShuffleExtension` doing something; without a way to tell the extension to shut down, those tasks might block forever, deadlocking the cluster.


It depends what "releasing" means. Currently releasing a task on the worker means

Ensure the task is not executed and no coroutine is running which is trying to fetch that tasks result/key/data.

Reset any TaskState attributes to a neutral value.

Remove the TaskState.key from Worker.data if it is in there.

Whatever the extension does, it might be coupled to any of these mechanisms and a "release task" action would be sufficient for cleanup

There could be a mechnaism to release input buffer immediately and we could get rid of data in memory/disk but output buffer is actually very difficult

Remove the TaskState.key from Worker.data if it is in there.

In the POC we used this as the signal to clean up state (the ShuffleService object was a dependency of every task, got copied into every Worker.data, and had a __del__ method to clean things up when it was removed).

But this is unreliable in pathological shuffle cases; for example when some workers either don't send any data or don't receive any data in the shuffle. Also, there are weird questions around what happens when that object itself is spilled to disk.

Whatever the extension does, it might be coupled to any of these mechanisms and a "release task" action would be sufficient for cleanup

I agree that this would be nice, but I don't really like the idea of hooking into gather_dep / TaskState attribute changes to detect when this has happened. It just feels brittle to me and like it requires really really careful reasoning about what signals the scheduler is going to send in which cases. Having an explicit mechanism built into RerunGroup seems both easy and reliable.

In the spirit of what I said in the blog post:

It may be worth it to adjust the system so that bypassing task graphs isn’t a hack, but a supported (and tested and maintained) feature.
https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/

I just prefer the idea of making a clear path for how to this in the API that doesn't require deep understanding of the current implementation of the worker and scheduler.

distributed/shuffle/shuffle-design.md

fjetter · 2021-10-28T14:17:36Z

distributed/shuffle/shuffle-design.md

+
+Therefore, we propose adding a `RerunGroup` (`ImpureGroup`? `CoExecutionGroup`? `RestartGroup`? `OutOfBandGroup`? name TBD) structure to the scheduler which intertwines the fates of all tasks within it: if any one task is to be rescheduled (due to its worker leaving), all tasks are restarted; if any one is to be prematurely released (due to cancellation), all are released.
+
+Membership in a `RerunGroup` is implemented via task annotations, where each task gives the name of the `RerunGroup` it belongs to. A task can belong to at most one `RerunGroup`. TBD if we will enforce any structural restrictions on `RerunGroup`s to prevent odd/invalid states from emerging—we probably should, such as not allowing disjoint parts of the graph to be in the same `RerunGroup`, etc.


I'm inclined to say this is out of scope and the responsibility of the user.

How are annotations impacted by graph optimizations?

Question: What happens if you fuse an annotated blockwise with an unannotated blockwise. The intended behaviour is currently unclear.

The rerunning might actually be a special case since this specific case is "inheritable", i.e. fused objects should both be rerun.

The current default behaviour according to Jim is that annotated tasks are not fused but we might be able to special case this

I'm inclined to say this is out of scope and the responsibility of the user.

I think this could create all sorts of headaches for us down the line in the scheduler, where we have to bend ourselves into knots to handle weird cases that are technically possible but nobody should actually do. A more restrictive API up front means less work for us!

Question: What happens if you fuse an annotated blockwise with an unannotated blockwise. The intended behaviour is currently unclear.

There's an open issue around annotations getting lost during optimization dask/dask#7036 but I don't think it fully recognizes the problem (we don't have a clear policy on annotation propagation during optimization). Fixing this would also be necessary work here; I've mentioned it in the doc.

distributed/shuffle/shuffle-design.md

mrocklin · 2021-10-31T15:06:27Z

In general the objectives and constraints laid out here seems sensible to me. Thank you for writing this up @gjoseph92 . I would like to suggest that we think about a plan for resilience, but implement that last. There is value in getting this out quickly so that people can experiment and give us feedback. I expect resilience to be an interesting but time consuming and ultimately separable problem.

distributed/shuffle/shuffle-design.md

Incorporating discussion from dask#5435 (comment) and dask#5435 (comment)

[DNM] Peer-to-peer shuffle design

a9f15aa

gjoseph92 added 3 commits October 26, 2021 11:48

fix graph figure

9f3ee13

high-dpi transfer png

847425b

fix many typos

7077541

GenevieveBuckley mentioned this pull request Oct 28, 2021

Shuffle prototype: Feedback (disk usage + workers dying) dask/dask#8294

Open

gjoseph92 mentioned this pull request Oct 28, 2021

[DNM] Scatter shuffle proof-of-concept #5473

Closed

Florians rough thoughts

0060cfc

fjetter reviewed Oct 28, 2021

View reviewed changes

GenevieveBuckley reviewed Oct 29, 2021

View reviewed changes

distributed/shuffle/shuffle-design.md Outdated Show resolved Hide resolved

gjoseph92 mentioned this pull request Nov 8, 2021

Optimized groupby aggregations when grouping by a sorted index dask/dask#8361

Open

gjoseph92 added 8 commits November 8, 2021 15:04

problems to solve -> problems this solves

04868a6

spill to disk

213f5ad

backpressure

897c375

concurrency

fdf6e65

serialization

aef4156

networking

cc3f238

remove extraneous section

5fb5c56

add xrefs

bb37433

gjoseph92 commented Nov 12, 2021

View reviewed changes

distributed/shuffle/shuffle-design.md Outdated Show resolved Hide resolved

distributed/shuffle/shuffle-design.md Show resolved Hide resolved

Scheduler extension and graph culling sections

481958c

Incorporating discussion from dask#5435 (comment) and dask#5435 (comment)

This was referenced Nov 17, 2021

P2P shuffle skeleton #5520

Merged

[DNM] P2P shuffle skeleton - scheduler plugin #5524

Closed

This was referenced Mar 14, 2022

P2P shuffle questions #5939

Open

[Draft] Services for out-of-band operations #5948

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] Peer-to-peer shuffle design #5435

[DNM] Peer-to-peer shuffle design #5435

gjoseph92 commented Oct 18, 2021

gjoseph92 commented Oct 18, 2021

fjetter Oct 28, 2021

fjetter Oct 28, 2021

gjoseph92 Nov 8, 2021

fjetter Oct 28, 2021

fjetter Oct 28, 2021

fjetter Oct 28, 2021

gjoseph92 Nov 8, 2021

gjoseph92 Nov 8, 2021

mrocklin commented Oct 31, 2021


		Because most the tasks in the shuffle graph are impure and run for their side effects, restarting an in-progress shuffle requires rerunning _every_ task involved, even ones that appear to have successfully transitioned to `memory` and whose "results" are stored on non-yet-dead workers.

		Additionally, cleanly stopping a running shuffle takes more than just releasing the shuffle tasks from memory: since there's out-of-band processing going on, the `ShuffleExtension` has to be informed in some way that it needs to stop doing whatever it's doing in the background, and clear out its buffers. Also, executing tasks may be blocking on the `ShuffleExtension` doing something; without a way to tell the extension to shut down, those tasks might block forever, deadlocking the cluster.


		Therefore, we propose adding a `RerunGroup` (`ImpureGroup`? `CoExecutionGroup`? `RestartGroup`? `OutOfBandGroup`? name TBD) structure to the scheduler which intertwines the fates of all tasks within it: if any one task is to be rescheduled (due to its worker leaving), all tasks are restarted; if any one is to be prematurely released (due to cancellation), all are released.

		Membership in a `RerunGroup` is implemented via task annotations, where each task gives the name of the `RerunGroup` it belongs to. A task can belong to at most one `RerunGroup`. TBD if we will enforce any structural restrictions on `RerunGroup`s to prevent odd/invalid states from emerging—we probably should, such as not allowing disjoint parts of the graph to be in the same `RerunGroup`, etc.

[DNM] Peer-to-peer shuffle design #5435

Are you sure you want to change the base?

[DNM] Peer-to-peer shuffle design #5435

Conversation

gjoseph92 commented Oct 18, 2021

gjoseph92 commented Oct 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Oct 31, 2021