Shuffle Service with Scheduler Logic #6007

mrocklin · 2022-03-28T01:01:31Z

Builds off of #5976 which in terms builds off of #5957

Commit message of the first novel commit:

Previously we assigned a partition to a worker based on a simple
formula. This was good, but also error prone in a few ways:

We would send data around needlessly if the task wasn't desired
Conflicting shuffles could assign tasks to conflicting workers

Now we as the scheduler to manage this process, and all of the workers
check in with it to get assignments. Currently the actual logic is the
same, but now things get to live in a single location where we can make
smarter decisions to avoid conflicts.

This removes the previous shuffle setup steps on the worker, simplifying
the code a bit (subjectively anyway). It makes the compute steps a bit
worse because now we have a small pandas join in the middle of things.
It should already avoid excess labor. It does not yet avoid restriction
conflicts.

Performance is good, still need to track down memory

This manages memory more smoothly. We still have issues though in that we're still passing around slices of arrow tables, which hold onto large references

This helps to reduce lots of extra unmanaged memory This flow pretty well right now. I'm finding that it's useful to blend between the disk and comm buffer sizes. The abstraction in multi_file and multi_comm are getting a little bit worn down (it would be awkward to shift back to pandas), but maybe that's ok.

Isn't solid yet

This avoids a race

We don't need a lot of comm buffer, we also don't want more connecitons than machines (too much sitting in buffers). We also improve some printing

To enable better diagnostics, it would be useful to allow worker extensions to piggy-back on the standard heartbeat. This adds an optional "heartbeat" method to extensions, and, if present, calls a custom method that gets sent to the scheduler and processed by an extension of the same name. This also starts to store the extensions on the worker in a named dictionary. Previously this was a list, but I'm not sure that it was actually used anywhere. This is a breaking change without deprecation, but in a space that I suspect no one will care about. I'm happy to provide a fallback if desired.

Tests are failing. I can't reproduce locally. This is just blind praying that it fixes the problem. It should be innocuous.

…ensions

Make sensitive to failed workers during input phase

This should hope to avoid some windows permissions errors

mrocklin · 2022-04-10T18:10:01Z

Tests are green(ish). This could use some review and some help in breaking things.

I haven't gotten too creative in trying to break this, but so far it holds up pretty well (better than what's currently in main) in situations where we don't lose any workers.

mrocklin · 2022-04-12T14:30:32Z

This is at a point where, I think, it could be merged. It does not succeed in cases where workers fail, or where outputs shift during execution (although neither does the solution in main), but it feels pretty solid in the common case. There is a proposed plan for a next step in #6105 .

This PR hasn't had deep review yet (no one is keeping me honest around docstrings, code cleanliness, and so on) but it's been about a month so far without that review, and I do plan to keep working on this. I'm inclined to strip out link to my mrocklin/dask fork, merge that in, and then merge this in. I won't do this solo (this is big enough that that doesn't seem right), but I'll probably start pestering people to get this in by early next week.

mrocklin · 2022-04-17T14:50:18Z

Merging in one week if there are no further comments.

…cheduler

gjoseph92

I've made it mostly through shuffle_extension.py so far. These are some preliminary comments. Some are nits/tweaks/asks for documentation, but there are at least 2-3 that I think are serious issues (generally around not handling errors in concurrency #6201) that could result in deadlocks or silently incorrect results.

.github/workflows/tests.yaml

gjoseph92 · 2022-04-27T23:07:07Z

distributed/shuffle/__init__.py

    ShuffleWorkerExtension,
 )
-
-__all__ = [


__all__ usually makes flake8 and typecheckers happier?

I'm curious about this. My understanding is that there isn't anything defined in this module that isn't listed in __all__. Why would __all__ be informative in this case?

distributed/shuffle/arrow.py

gjoseph92 · 2022-04-27T23:10:54Z

distributed/shuffle/arrow.py

+    if file.tell() == 0:
+        file.write(schema.serialize())


This doesn't seem thread-safe. Probably good to mention.

Sure. Done.

gjoseph92 · 2022-04-27T23:19:01Z

distributed/shuffle/arrow.py

+    bio = io.BytesIO()
+    bio.write(schema.serialize())
+    for batch in data:
+        bio.write(batch)


This seems like a lot more memory copying than I would expect with Arrow. I assume improving this isn't important here, and would just be in scope for future performance tuning?

You and me both. Yeah, Arrow is great if you use all Arrow primitives, but if you want to compose it with other things it's not that great because of all of the views and the lack of a deserialize function. If we stick with Arrow (maybe?) then we should upstream a bunch of issues for this use case so that it gets cleaner in future releases.

I actually did a bit of performance tuning here. Memory copies aren't significant yet, but I've reduced them to the extent that I can.

gjoseph92 · 2022-04-28T01:24:35Z

distributed/shuffle/shuffle_extension.py

-
-    def shuffle_inputs_done(self, comm: object, shuffle_id: ShuffleId) -> None:
+        shuffle = await self._get_shuffle(shuffle_id)
+        future = asyncio.ensure_future(shuffle.receive(data))


What if this task fails, but nothing's awaiting it? receive can raise (and even intentionally raises) exceptions. Something should be done to keep track of it. I think that piece of data would just be silently lost, producing incorrect output.

This can also allow memory to build up in the receive tasks, before they've started writing to the multi_file. I don't think there's any limitation on concurrent receive calls? At the beginning of a shuffle with 1000s of peer workers, couldn't you even start backing up behind the offload threadpool?

Those two issues make me wonder if we should always be awaiting something here. Both because if the data can't be successfully written to disk, that should probably be an error that's propagated back to the sender, and because it would give more useful backpressure.

I get not wanting to block sends on data fully writing to disk. That would probably slows us down a bit, especially at the beginning of a large shuffle. But as I understand, performance tuning is a separate, later step, and not always awaiting shuffle.receive here feels like a tricky optimization that currently risks both incorrectness and blowing up memory.

Yeah, that's a valid concern. I don't yet have a great answer to this yet. We can raise the exception here early if we want to (avoids hiding the exceptions that we're intentionally raising, but doesn't handle the unknown ones).

If you're cool with it, I'd like to put in a TODO here so that we don't forget, and then defer this to future work.

distributed/shuffle/shuffle_extension.py

gjoseph92 · 2022-04-28T01:36:14Z

distributed/shuffle/shuffle_extension.py

+        )
+
+    def close(self):
+        self.executor.shutdown()


I'd think there's more to do than this? Stop any active shuffles? Clean up disk?

Good thought. I've changed this to close all shuffles (which in turn clean up multi_files (which in turn clean up disk)).

There is now also a test for closing workers mid-shuffle. (with a couple of TODOs as well for the next round)

distributed/shuffle/shuffle_extension.py

gjoseph92 · 2022-04-29T22:39:01Z

distributed/shuffle/multi_comm.py

+        except KeyError:
+            queue = asyncio.Queue()
+            for _ in range(MultiComm.max_connections):
+                queue.put_nowait(None)


It took me a bit to understand that this queue works the opposite way you'd expect—it's not pushing data in, it's pushing permission to write more back out. Sort of a form of CapacityLimiter? I think this would help to document.

I've added a note in the docstring

I agree with the comment. Took me a bit to understand this myself.

I think factoring this out should not be scope of this PR but I could see value in doing it regardless.

I think this is a case for a follow up ticket. @gjoseph92 would you mind opening one?

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

…cheduler

mrocklin · 2022-05-02T14:42:06Z

@gjoseph92 I know that you're travelling, so no pressure, but if you have time for another pass that would be welcome.

distributed/shuffle/shuffle_extension.py

fjetter · 2022-05-02T15:04:38Z

distributed/shuffle/multi_comm.py

+        except KeyError:
+            queue = asyncio.Queue()
+            for _ in range(MultiComm.max_connections):
+                queue.put_nowait(None)


I agree with the comment. Took me a bit to understand this myself.

I think factoring this out should not be scope of this PR but I could see value in doing it regardless.

I think this is a case for a follow up ticket. @gjoseph92 would you mind opening one?

fjetter · 2022-05-02T15:08:45Z

distributed/shuffle/multi_file.py

+from distributed.utils import log_errors
+
+
+class MultiFile:


I agree with both. I'm not entirely sure if the classes themselves should be the same or if some common functionality can be shared (e.g. factor our the queue stuff as a CapacityLimiter (xref https://github.com/dask/distributed/pull/6007/files#r862239876)

mrocklin · 2022-05-03T12:28:56Z

Good to go?

…cheduler

mrocklin added 30 commits March 13, 2022 11:20

Move pandas groupby outside of event loop

9deb8c6

Add MultiFile prototype

3f62911

Integrate MultiFile with shuffle extension

626eda0

Add buffered comms

7f48f77

Move multi files to shuffle/

99bb283

add arrow

0d49ab9

Performance is good, still need to track down memory

Handle buffers manually in multi_file

1e1311d

This manages memory more smoothly. We still have issues though in that we're still passing around slices of arrow tables, which hold onto large references

Clean up a few extra copies

5781b7e

Let comms continue without blocking on disk

0ce6e01

Isn't solid yet

Move flush into multi_file.read

3204997

This avoids a race

Avoid multiple accesses to the same file

8b11d6d

Change configuration for smoother single-machine use

62cc43d

We don't need a lot of comm buffer, we also don't want more connecitons than machines (too much sitting in buffers). We also improve some printing

Fix up some concurrency issues

5a64248

Fix shard size accountiing

b901613

add more connections if more workers

24092bb

Merge branch 'heartbeat-extensions' into p2p-shuffle

89f6347

Remove file cache

bc81db8

First pass on adding a Scheduler extension and worker heartbeat

27d2ab3

Name scheduler extensions

9fd6da0

Merge branch 'heartbeat-extensions' into p2p-shuffle

8641abb

fixup test

34617db

Add timing and diagnostics

e1c0a4d

fixup tests

8c28b83

Use names for client extensions

29253b4

Add back in manual addition of stealing extension

1f79575

Tests are failing. I can't reproduce locally. This is just blind praying that it fixes the problem. It should be innocuous.

Merge branch 'heartbeat-extensions' into p2p-shuffle

cf9a939

Add basic shuffling dashboard

6f2286e

Merge branch 'main' of github.com:dask/distributed into heartbeat-ext…

efedc04

…ensions

mrocklin added 9 commits April 9, 2022 17:24

remove input_workers (this is no longer necessary)

fc74a25

Track output_workers in shuffle

8ab8748

Make sensitive to failed workers during input phase

cleanup tests

24c3f56

cleanup windows test

68a8645

mild cleanup

83524c3

work around cython _worker_restrictions

7422dd1

use gen_test over asyncio

99edba0

xfail test_crashed_worker

9140578

move output partitions accounting to after read

e6cc6a2

This should hope to avoid some windows permissions errors

mrocklin added 2 commits April 11, 2022 13:01

Add failing test for forgotten tasks

f4cf53e

add failing test for updated shuffle

c06ee98

mrocklin mentioned this pull request Apr 11, 2022

Shuffle service resilience #6105

Closed

Merge branch 'main' of github.com:dask/distributed into p2p-shuffle-s…

2e403f9

…cheduler

gjoseph92 reviewed Apr 29, 2022

View reviewed changes

mrocklin and others added 3 commits May 1, 2022 10:44

Apply suggestions from code review

0939eae

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Merge branch 'main' of github.com:dask/distributed into p2p-shuffle-s…

ba08904

…cheduler

Respond to comments

8bd8119

fjetter reviewed May 2, 2022

View reviewed changes

mrocklin added 2 commits May 3, 2022 13:59

Merge branch 'main' of github.com:dask/distributed into p2p-shuffle-s…

fc50a5c

…cheduler

remove git install

936b6a9

fjetter approved these changes May 5, 2022

View reviewed changes

fjetter merged commit bc3c891 into dask:main May 5, 2022

jakirkham mentioned this pull request May 20, 2022

Fix downstream CI cloudpipe/cloudpickle#471

Merged

mrocklin deleted the p2p-shuffle-scheduler branch August 17, 2022 16:31

mrocklin mentioned this pull request Oct 11, 2022

WIP: Ship graphs from client to scheduler with pickle #6028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle Service with Scheduler Logic #6007

Shuffle Service with Scheduler Logic #6007

mrocklin commented Mar 28, 2022

mrocklin commented Apr 10, 2022

mrocklin commented Apr 12, 2022

mrocklin commented Apr 17, 2022

gjoseph92 left a comment

gjoseph92 Apr 27, 2022

mrocklin May 1, 2022

gjoseph92 Apr 27, 2022

mrocklin May 1, 2022

gjoseph92 Apr 27, 2022

mrocklin May 1, 2022

gjoseph92 Apr 28, 2022

mrocklin May 1, 2022

gjoseph92 Apr 28, 2022

mrocklin May 1, 2022

gjoseph92 Apr 29, 2022

mrocklin May 1, 2022

fjetter May 2, 2022

mrocklin commented May 2, 2022

fjetter May 2, 2022

fjetter May 2, 2022

mrocklin commented May 3, 2022

Shuffle Service with Scheduler Logic #6007

Shuffle Service with Scheduler Logic #6007

Conversation

mrocklin commented Mar 28, 2022

mrocklin commented Apr 10, 2022

mrocklin commented Apr 12, 2022

mrocklin commented Apr 17, 2022

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented May 3, 2022