Shuffle service resilience #6105

mrocklin · 2022-04-11T22:30:51Z

In #5520 and #5976 and #6007 we've started a shuffle service. This has better memory characteristics, but is not resilient. In particular, it can break in a few ways:

A worker holding shuffle outputs can die mid-shuffle
New outputs of a shuffle can be requested by a client after the shuffle has started
Output futures of a shuffle can be unrequested by a client

There are a few ways to solve this problem. One way I'd like to discuss here is opening up scheduler events to extensions, and letting them trigger transitions. In particular both scenarios 1 and 2 can be handled by letting the extension track remove_worker and update_graph events and restart all shuffle tasks if an output-holding worker dies, or if any of the existing shuffles occur in a new graph. Scenario 3 can be handled by letting the extension track transition events, and clean things up when the barrier task transitions out of memory.

So far, I think that this can solve all of the resilience issues in shuffling (at least everything I've come across). However, it introduces two possible concerns:

1 - Scheduler performance

Maybe it doesn't make sense for every transition to cycle through every extension to see if they care about transitions.

This doesn't seem to be that expensive in reality

In [1]: extensions = [object() for _ in range(10)]

In [2]: %%timeit
   ...: for i in range(1000):
   ...:     for extension in extensions:
   ...:         if hasattr(extension, "transition"):
   ...:             extension.transiiton()
   ...:
383 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So around 40ns per extension per transition. It's well under a microsecond for all of our extensions.

2 - Complexity

Now any extension can inject transitions. Horrible horrible freedom!

My guess is that this is ok as long as we maintain some hygiene around, for example, always using the trasitions system, rather than mucking about with state directly

This is also a somewhat systemic change for what is, today, a single feature.

cc @gjoseph92 @fjetter for feedback

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-04-12T00:18:56Z

Scheduler plugins already support all these events. Why not just use that interface? In practice, there's just as much horrible horrible freedom with plugins as extensions, since they also give you access to the Scheduler object to do whatever you want with. That's why I used a plugin in #5524 instead of an extension: https://github.com/dask/distributed/pull/5524/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R191-R194

The only upside I see to extensions is that they generally have at-most-one semantics, whereas plugins don't assume they're singletons.

An interface change we could consider is allowing plugins to return recommendations after these events. (In fact I once made this change a few years ago in a private fork for similar reasons.) This would be a slightly more well-structured way for plugins to affect things than calling transition or manipulating state directly.

Also, your benchmark doesn't include the function call (because object() has no attribute 'transition'). Adding that in more than doubles the time on my machine (533µs -> 1.42ms). When you consider that that function might actually... do something (😱) (mostly just to check there's nothing serious to do), I might actually worry a little about adding that overhead on every transition.

mrocklin · 2022-04-12T01:13:37Z

Most extensions won't have such a method, but you're right it would be good to include it on of them. A plugin might make sense though. It's certainly nicer in that it doesn't require any new changes to core.

…

On Mon, Apr 11, 2022 at 7:19 PM Gabe Joseph ***@***.***> wrote: Scheduler plugins already support all these events. Why not just use that interface? In practice, there's just as much horrible horrible freedom with plugins as extensions, since they also give you access to the Scheduler object to do whatever you want with. That's why I used a plugin in #5524 <#5524> instead of an extension: https://github.com/dask/distributed/pull/5524/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R191-R194 The only upside I see to extensions is that they generally have at-most-one semantics, whereas plugins don't assume they're singletons. An interface change we could consider is allowing plugins to return recommendations after these events. (In fact I once made this change a few years ago in a private fork for similar reasons.) This would be a slightly more well-structured way for plugins to affect things than calling transition or manipulating state directly. Also, your benchmark doesn't include the function call (because object() has no attribute 'transition'). Adding that in more than doubles the time on my machine (533µs -> 1.42ms). When you consider that that function might actually... do something (😱) (mostly just to check there's nothing serious to do), I might actually worry a little about adding that overhead on every transition. — Reply to this email directly, view it on GitHub <#6105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDPG5E4QV667VYFCO3VES6PXANCNFSM5TETLZBA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mrocklin · 2022-04-12T14:17:32Z

@gjoseph92 do you see obvious flaws with the plan above? Do you thnk that it would do the job? (although possibly not as nicely as desired)

fjetter · 2022-10-24T12:58:00Z

A couple of questions/comments mostly for my understanding

I think a remove_worker hook for the shuffle extension is straight forward and the easiest approach for now
What would be an example for "new outputs requested"?

Is this something like the following...

Cull the graph such that only some output keys are required
Later, request the entire graph after all

df = ...
x = df.shuffle(on="x")
y = x.partitions[x.npartitions//2].persist()
sleep(0.1)
z = x.persist()

This currently fails and there are actually a couple of different error cases depending on how data is distributed on the cluster. (Sometimes an assert, sometimes a KeyError; wouldn't be surprised if there is something else)

That's basically test_add_some_results

Basically the other way round

y = x.persist()
z = y.partitions[y.npartitions // 2].persist()
del y

which currently does not clean up stuff / close an existing shuffle properly. This is already written in test_delete_some_results

Did I understand these error cases properly?

mrocklin · 2022-10-24T13:23:48Z

Did I understand these error cases properly?

Based on my (bad) memory, yes.

mrocklin · 2022-10-24T13:24:40Z

In many cases I think that the short term answer is

Can we identify the situation?
Can we reset everything to a clean state and start over?

If we get good at both then that gives us a not ideal but totally workable approach, I think.

fjetter · 2022-10-24T15:00:57Z

I would suggest a slightly different approach

(as I said, remove_worker hook)
I think this is a culling problem. The issue we are seeing here originates from the way we are building the graph. Whenever we cull output tasks, we should also do less work on the input and we basically rewrite the task graph.

If we look at this example

df = ...
x = df.shuffle(on="x")
y = x.partitions[x.npartitions//2].persist()
sleep(0.1)
z = x.persist()

I suggest that z and y don't even share any shuffling related keys and both perform a unique shuffle op. I think this is much cleaner since it

makes culled graphs much faster (although I don't know how relevant this is for shuffle)
Allows us to cancel both very cleanly
We don't need to hook into the very, very messy update_graph

This is pretty easy if we just define a P2PShuffleLayer

The only problem I can come up with (after playing with it as well) is that we're not cleaning up state properly. That's unfortunate but not necessarily critical.
I suggest to replace ShuffleSchedulerExtension.register_complete with an appropriate transition hook, ensuring that we clean up after ourselves (broadcasting to the workers if need be)

mrocklin mentioned this issue Apr 12, 2022

Shuffle Service with Scheduler Logic #6007

Merged

fjetter added discussion Discussing a topic with no specific actions yet scheduler shuffle labels Oct 24, 2022

fjetter mentioned this issue Oct 24, 2022

User a layer for p2p shuffle #7180

Merged

hendrikmakait self-assigned this Nov 11, 2022

hendrikmakait mentioned this issue Nov 17, 2022

Fail P2PShuffle gracefully upon worker failure #7326

Merged

2 tasks

fjetter mentioned this issue Nov 25, 2022

P2P ensure no leftover state when tasks are released #7352

Closed

fjetter mentioned this issue Dec 6, 2022

RFC Sign every compute task with a unique counter to correlated responses #7372

Closed

fjetter mentioned this issue Jan 27, 2023

P2P shuffle deduplicates data and can be run several times #7486

Merged

2 tasks

hendrikmakait mentioned this issue Jan 27, 2023

Unskip test_delete_some_results #7508

Merged

2 tasks

fjetter closed this as completed in #7508 Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle service resilience #6105

Shuffle service resilience #6105

mrocklin commented Apr 11, 2022

gjoseph92 commented Apr 12, 2022

mrocklin commented Apr 12, 2022 via email

mrocklin commented Apr 12, 2022

fjetter commented Oct 24, 2022

mrocklin commented Oct 24, 2022

mrocklin commented Oct 24, 2022

fjetter commented Oct 24, 2022 •

edited

Loading

Shuffle service resilience #6105

Shuffle service resilience #6105

Comments

mrocklin commented Apr 11, 2022

1 - Scheduler performance

2 - Complexity

gjoseph92 commented Apr 12, 2022

mrocklin commented Apr 12, 2022 via email

mrocklin commented Apr 12, 2022

fjetter commented Oct 24, 2022

mrocklin commented Oct 24, 2022

mrocklin commented Oct 24, 2022

fjetter commented Oct 24, 2022 • edited Loading

fjetter commented Oct 24, 2022 •

edited

Loading