Test retire workers deadlock #6240

gjoseph92 · 2022-04-28T20:18:06Z

Tests for #6234. The test is very timing-sensitive, so I think I've set it up to pytest.skip itself if the timing doesn't work out to test the condition we want.

However, that won't work until #6239 is fixed. So I think we should probably hold off on merging, lest we introduce another flaky test.

Tests added / passed
Passes pre-commit run --all-files

crusaderky · 2022-04-28T22:50:23Z

distributed/tests/test_active_memory_manager.py

@@ -9,13 +9,16 @@

 import pytest

-from distributed import Nanny, wait
+from distributed import Event, Nanny, wait


Suggested change

from distributed import Event, Nanny, wait

from distributed import Event, Scheduler, Nanny, Worker, wait

distributed/tests/test_active_memory_manager.py

crusaderky · 2022-04-28T22:56:38Z

distributed/tests/test_active_memory_manager.py

+        await asyncio.sleep(0.01)
+
+    # `_track_retire_worker` _should_ now be sleeping for 0.5s, because there were >=200 keys on A.
+    # In this test, everything from here on needs to happen within 0.5s.


Not quite true - everything from the very beginning of the replication needs to happen within 0.5s. Which I suspect it may cause flakiness on our CI.

crusaderky · 2022-04-28T22:58:53Z

distributed/tests/test_active_memory_manager.py

+    assert isinstance(policy, RetireWorker)
+
+    # This will drop all the `xs` from A (since they're already replicated on B).
+    amm.run_once()


Unnecessary. Dropping is not a precondition for done; it's just to reduce memory pressure in case of spilled keys.

Calling run_once is necessary, because what we wan to test is:

_track_retire_worker is sleeping

policy runs and removes itself because all keys have been replicated

another key appears on the retiring worker

_track_retire_worker wakes up

By running it once here, it will remove itself.

crusaderky · 2022-04-28T22:59:54Z

distributed/tests/test_active_memory_manager.py

+    # This will drop all the `xs` from A (since they're already replicated on B).
+    amm.run_once()
+
+    # The policy has removed itself, because there's no more data in need of replication.


This is not what happens if you don't manually force run_once though.

distributed/tests/test_active_memory_manager.py

github-actions · 2022-04-29T01:02:02Z

Unit Test Results

      15 files +      3       15 suites +3 6h 35m 29s ⏱️ + 1h 4m 34s
  2 741 tests +      4   2 656 ✔️ +      9   80 💤 -   10   5 ❌ +  5
20 291 runs +3 901 19 335 ✔️ +3 728 938 💤 +155 18 ❌ +18

For more details on these failures, see this check.

Results for commit a0d6f69. ± Comparison against base commit b837003.

Co-authored-by: crusaderky <crusaderky@gmail.com>

…-deadlock

* If you set `poll_interval` to 0 or a small value in `_track_retire_worker`, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run. * With `--count=1000 -n10` on my machine it's passed 100% of the time (no skips even) * If you remove the critical change from `RetireWorkerPolicy.done()`, it always fails

gjoseph92

@crusaderky I've incorporated as much of your feedback as I could, but ended up having to walk back to more or less having the test look like I originally wrote it. That's the only way I can get it to actually test the condition from #6234.

I'm pretty confident this test won't become flaky:

If you set poll_interval to 0 or a small value in _track_retire_worker, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run.
With --count=1000 -n10 on my machine it's passed 100% of the time (no skips)
If you remove the critical change (RetireWorker policy is done if removed #6234) from RetireWorkerPolicy.done(), it always fails.

The main downside is that it's extremely reliant on some very specific behavior, so if the retire_workers mechanism or policy is refactored significantly, it might become meaningless. In that case though, I would still expect it to consistently fail, or skip itself.

gjoseph92 · 2022-06-15T04:37:36Z

distributed/tests/test_active_memory_manager.py

+    assert isinstance(policy, RetireWorker)
+
+    # This will drop all the `xs` from A (since they're already replicated on B).
+    amm.run_once()


Calling run_once is necessary, because what we wan to test is:

_track_retire_worker is sleeping

policy runs and removes itself because all keys have been replicated

another key appears on the retiring worker

_track_retire_worker wakes up

By running it once here, it will remove itself.

github-actions · 2022-06-15T05:30:16Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 36m 10s ⏱️ -23s
  2 866 tests +1   2 784 ✔️ +30   80 💤 ±0 2 ❌ - 26
21 231 runs +7 20 291 ✔️ +37 938 💤 +2 2 ❌ - 29

For more details on these failures, see this check.

Results for commit b09b1da. ± Comparison against base commit 344868a.

gjoseph92 · 2022-06-15T14:51:07Z

The test has passed (not skipped) on every CI run here.

Flaky tests are:

Ready for final review.

fjetter

Looks good to me. I'll leave some time for @crusaderky since he's obviously the expert. but will merge tomorrow morning (my time) in case there is no more feedback.

gjoseph92 added 3 commits April 27, 2022 13:17

RetireWorker policy is done if removed

eae1bd4

wip test

1cbee7f

Test

a0d6f69

gjoseph92 requested a review from crusaderky April 28, 2022 20:18

gjoseph92 mentioned this pull request Apr 28, 2022

RetireWorker policy is done if removed #6234

Merged

2 tasks

crusaderky assigned crusaderky and gjoseph92 and unassigned crusaderky Apr 28, 2022

crusaderky reviewed Apr 28, 2022

View reviewed changes

distributed/tests/test_active_memory_manager.py Outdated Show resolved Hide resolved

crusaderky reviewed Apr 28, 2022

View reviewed changes

distributed/tests/test_active_memory_manager.py Outdated Show resolved Hide resolved

crusaderky reviewed Apr 28, 2022

View reviewed changes

distributed/tests/test_active_memory_manager.py Outdated Show resolved Hide resolved

crusaderky reviewed Apr 28, 2022

View reviewed changes

distributed/tests/test_active_memory_manager.py Outdated Show resolved Hide resolved

crusaderky reviewed Apr 28, 2022

View reviewed changes

distributed/tests/test_active_memory_manager.py Outdated Show resolved Hide resolved

gjoseph92 and others added 5 commits June 14, 2022 19:07

Suggestions from Guido

585a17e

Co-authored-by: crusaderky <crusaderky@gmail.com>

Main suggestion from Guido

5d5498c

Co-authored-by: crusaderky <crusaderky@gmail.com>

Merge remote-tracking branch 'upstream/main' into test-retire-workers…

2ccd44a

…-deadlock

clean up imports

b09b1da

gjoseph92 commented Jun 15, 2022

View reviewed changes

fjetter reviewed Jun 15, 2022

View reviewed changes

fjetter merged commit 33c5cb2 into dask:main Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test retire workers deadlock #6240

Test retire workers deadlock #6240

gjoseph92 commented Apr 28, 2022

crusaderky Apr 28, 2022

crusaderky Apr 28, 2022

crusaderky Apr 28, 2022

gjoseph92 Jun 15, 2022

crusaderky Apr 28, 2022

github-actions bot commented Apr 29, 2022

gjoseph92 left a comment

gjoseph92 Jun 15, 2022

github-actions bot commented Jun 15, 2022

gjoseph92 commented Jun 15, 2022

fjetter left a comment

	from distributed import Event, Nanny, wait
	from distributed import Event, Scheduler, Nanny, Worker, wait

Test retire workers deadlock #6240

Test retire workers deadlock #6240

Conversation

gjoseph92 commented Apr 28, 2022

crusaderky Apr 28, 2022

Choose a reason for hiding this comment

crusaderky Apr 28, 2022

Choose a reason for hiding this comment

crusaderky Apr 28, 2022

Choose a reason for hiding this comment

gjoseph92 Jun 15, 2022

Choose a reason for hiding this comment

crusaderky Apr 28, 2022

Choose a reason for hiding this comment

github-actions bot commented Apr 29, 2022

Unit Test Results

gjoseph92 left a comment

Choose a reason for hiding this comment

gjoseph92 Jun 15, 2022

Choose a reason for hiding this comment

github-actions bot commented Jun 15, 2022

Unit Test Results

gjoseph92 commented Jun 15, 2022

fjetter left a comment

Choose a reason for hiding this comment