Race conditions from fetch to compute while AMM requests replica #6248

fjetter · 2022-04-29T13:32:14Z

Two things in here

The order in which we fill the recommendations impacts the order in which we send messages. This specific case could mean

Worker A holds key T1
Worker B computes key T2 that depends on T1
Worker B fetches T1 (not in flight, yet, just queued up)
Worker A dies
Scheduler transitions T1 back to released
This transition recommends transitioning T2 back to released which will generate a worker message to B to release T2 again
Scheduler transitions T2 to waiting and afterwards to processing

depending on the order, at the time the handle-compute T1 signal arrives at B, the state will be either

T1: waiting
T2: waiting (which is strictly speaking false but shouldn't be harmful)
or just
T1:waiting, i.e. T1 has no dependents from the POV of the worker

The problem of #6244 is that a task is requested to be fetched and has no dependents on the worker. Therefore, if the worker is asked to release this task, there is no reason to hold on to it and it transitions it to forgotten. There is currently no way out of forgotten other than through another handle_compute or acquire_replica (i.e. another ensure_task_exists).

The particular transition fetch->compute will expand to a fetch->released->compute and we do not want to forget about the task. There are two options to avoid this

If we encounter a forgotten task while doing these transitions, we could call ensure_task_exists to "revert" this
We explicitly block forgotten transitions in such a chain (this is what I am proposing since I think it's the simplest approach)

mrocklin · 2022-04-29T14:28:44Z

I took a brief look at this. No objection from me, but I didn't dive deeply into the logic. If tests pass and you're feeling confident about the added value @fjetter I think that it's ok to merge. If you can get someone like @crusaderky or @gjoseph92 to take a look that would be better of course.

fjetter · 2022-04-29T14:30:35Z

CI seems to get stuck on distributed/tests/test_steal.py::test_steal_when_more_tasks. It crashes immediately with an InvalidTransition locally. I'll have a look

github-actions · 2022-04-29T18:35:51Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 17m 0s ⏱️ - 16m 32s
  2 759 tests +  2   2 677 ✔️ +  2     78 💤 - 1 4 ❌ +1
22 034 runs +16 21 011 ✔️ +13 1 018 💤 +2 5 ❌ +1

For more details on these failures, see this check.

Results for commit e6e4b50. ± Comparison against base commit 2286896.

♻️ This comment has been updated with latest results.

gjoseph92

Honestly, I don't fully understand what these changes are doing (especially reordering the transition_memory_released code). But if the test you've added don't pass without them, that seems like a good sign?

gjoseph92 · 2022-04-29T20:55:32Z

distributed/scheduler.py

+            for dts in ts.waiters:
+                if dts.state in ("no-worker", "processing"):
+                    recommendations[dts.key] = "waiting"
+                elif dts.state == "waiting":
+                    dts.waiting_on.add(ts)


To confirm: the reason for moving this code later is because transitions are insertion-ordered, so this way key gets transitioned to forgotten/waiting before its waiters get transitioned to waiting?

I wonder if this block should even be in transition_memory_released at all? Why shouldn't this part be done in transition_released_waiting and transition_released_forgotten? The fact that we need to make this transition after we make the waiting/forgotten transition makes me think we're overstepping our job in this function, and this should be the job of the other transitions.

I guess I didn't know that the recommendations dict was considered ordered. (Obviously dicts are now ordered in Python, but did it used to be an OrderedDict in py<=3.6?) If there's some dependency structure in the recommendations (transition A has to happen before transition B), I'd think transition A should be responsible for recommending transition B, not that they should be mixed together. That seems easier to reason about.

To confirm: the reason for moving this code later is because transitions are insertion-ordered,

Yes, dictionaries are insertion ordered and popitem pops from the end

In [1]: adict = {} In [2]: for x in range(10): ...: adict[x] = x ...: In [3]: adict.popitem() Out[3]: (9, 9)

I wonder if this block should even be in transition_memory_released at all?

Yes, it should. If a task that was in memory is no longer in memory and a waiter, i.e. a task to be executed, exists, we need to ensure that this waiter is released. These few lines allow task to be resilient to worker failures.

Why shouldn't this part be done in transition_released_waiting

This transition is not there to reset/forget/release something but it schedules something for compute

and transition_released_forgotten

This transition should only be triggered after the scheduler deletes the entire graph. This should only ever have any scheduling consequences if there are multiple graphs scheduled that share keys. Otherwise this is simply a sophisticated pop task

If there's some dependency structure in the recommendations (transition A has to happen before transition B), I'd think transition A should be responsible for recommending transition B, not that they should be mixed together.

It's not about a dependency but about order.

In this specific case (see test

{'f1': 'waiting', 'f2': 'waiting'}

means:

Transition f2 from processing back to waiting

Then transition f1 from released to waiting (f1 was in memory/released before)

I don't see how we could ever infer "please transition f1 to waiting" after we released f1. From a causality perspective, I don't see how we could map this as a dependency

Edit: In a previous statement I argued that it's about finishing a chain of a tasks transitions but this was false. It's quite the opposite.

distributed/tests/test_worker.py

distributed/worker.py

fjetter · 2022-05-05T15:08:39Z

OSX 3.10 not ci1 failure is known #6233
OSX 3.8 ci1 failure I think is related to #6211

distributed/worker.py

distributed/tests/test_worker.py

crusaderky

Only two outstanding cosmetic issues for me

crusaderky · 2022-05-06T14:06:18Z

I think I'm seeing two genuine regressions in the unit tests

fjetter · 2022-05-06T18:05:57Z

The test regression that's left is fixed with #6297

crusaderky · 2022-05-11T21:47:17Z

distributed/worker.py

@@ -647,6 +648,7 @@ def __init__(
            ("ready", "released"): self.transition_generic_released,
            ("released", "error"): self.transition_generic_error,
            ("released", "fetch"): self.transition_released_fetch,
+            ("released", "missing"): self.transition_released_fetch,


Typo - this should be transition_generic_missing.
This causes an infinite transition loop (#6305):

The scheduler calls handle_compute_task with an empty who_has. This is a broken use case to begin with, but it is happening and the worker should either cope with it gracefully or crash loudly in handle_compute_task itself.

handle_compute_task creates the dependencies with ensure_task_exists in state=released and requests a transition to fetch

The released->fetch transition, handled by transition_released_fetch, notices that who_has is empty, so instead of transitioning to fetch it returns {ts: missing} and keeps the task in released state

The released->missing transition, erroneously handled by the same method, repeats point 3 forever.

Is this "fixed" by #6318?

fjetter marked this pull request as ready for review April 29, 2022 14:14

fjetter changed the title ~~WIP Race conditions from fetch to compute~~ Race conditions from fetch to compute Apr 29, 2022

fjetter changed the title ~~Race conditions from fetch to compute~~ Race conditions from fetch to compute while AMM requests replica Apr 29, 2022

fjetter mentioned this pull request Apr 29, 2022

Release 2022.4.2 dask/community#240

Closed

fjetter force-pushed the fetching_compute branch from 1a1b6bd to e1241ea Compare April 29, 2022 16:11

fjetter mentioned this pull request Apr 29, 2022

WIP Rework control flow for missing tasks #6251

Closed

gjoseph92 reviewed Apr 29, 2022

View reviewed changes

fjetter added 5 commits May 2, 2022 08:22

Fix recommendation ordering in transition_memory_released

d1c9d07

Ensure intermediate release transitions will not forget tasks

bb25cb7

Allow cancelled missing transition (to released)

fa4e421

Rework missing transitions

0e9fe37

Extend test with ordering check

49d4ff2

fjetter force-pushed the fetching_compute branch from e1241ea to 49d4ff2 Compare May 2, 2022 13:22

crusaderky mentioned this pull request May 5, 2022

Remove @avoid_ci from stress tests #6271

Closed

crusaderky reviewed May 5, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

crusaderky reviewed May 5, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

crusaderky reviewed May 5, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

Review comments

93494cc

crusaderky reviewed May 5, 2022

View reviewed changes

distributed/tests/test_worker.py Outdated Show resolved Hide resolved

crusaderky requested changes May 5, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into fetching_compute

e6e4b50

fjetter force-pushed the fetching_compute branch from b2026b0 to e6e4b50 Compare May 5, 2022 21:58

crusaderky approved these changes May 5, 2022

View reviewed changes

Reset timeout

9816c11

crusaderky merged commit 70e5c90 into dask:main May 6, 2022

crusaderky mentioned this pull request May 9, 2022

test_stress_scatter_death #6305

Closed

crusaderky reviewed May 11, 2022

View reviewed changes

crusaderky mentioned this pull request May 12, 2022

Prevent infinite transition loops; more aggressive validate_state() #6318

Merged

fjetter mentioned this pull request Jun 14, 2022

Alternatives for current ensure_communicating #6497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions from fetch to compute while AMM requests replica #6248

Race conditions from fetch to compute while AMM requests replica #6248

fjetter commented Apr 29, 2022 •

edited

Loading

mrocklin commented Apr 29, 2022

fjetter commented Apr 29, 2022

github-actions bot commented Apr 29, 2022 •

edited

Loading

gjoseph92 left a comment

gjoseph92 Apr 29, 2022

fjetter May 1, 2022 •

edited

Loading

fjetter commented May 5, 2022

crusaderky left a comment

crusaderky commented May 6, 2022

fjetter commented May 6, 2022

crusaderky May 11, 2022

fjetter May 16, 2022

crusaderky May 16, 2022

Race conditions from fetch to compute while AMM requests replica #6248

Race conditions from fetch to compute while AMM requests replica #6248

Conversation

fjetter commented Apr 29, 2022 • edited Loading

mrocklin commented Apr 29, 2022

fjetter commented Apr 29, 2022

github-actions bot commented Apr 29, 2022 • edited Loading

Unit Test Results

gjoseph92 left a comment

Choose a reason for hiding this comment

gjoseph92 Apr 29, 2022

Choose a reason for hiding this comment

fjetter May 1, 2022 • edited Loading

Choose a reason for hiding this comment

fjetter commented May 5, 2022

crusaderky left a comment

Choose a reason for hiding this comment

crusaderky commented May 6, 2022

fjetter commented May 6, 2022

crusaderky May 11, 2022

Choose a reason for hiding this comment

fjetter May 16, 2022

Choose a reason for hiding this comment

crusaderky May 16, 2022

Choose a reason for hiding this comment

fjetter commented Apr 29, 2022 •

edited

Loading

github-actions bot commented Apr 29, 2022 •

edited

Loading

fjetter May 1, 2022 •

edited

Loading