Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test retire workers deadlock #6240

Merged
merged 8 commits into from
Jun 16, 2022
Merged

Conversation

gjoseph92
Copy link
Collaborator

Tests for #6234. The test is very timing-sensitive, so I think I've set it up to pytest.skip itself if the timing doesn't work out to test the condition we want.

However, that won't work until #6239 is fixed. So I think we should probably hold off on merging, lest we introduce another flaky test.

  • Tests added / passed
  • Passes pre-commit run --all-files

@@ -9,13 +9,16 @@

import pytest

from distributed import Nanny, wait
from distributed import Event, Nanny, wait
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from distributed import Event, Nanny, wait
from distributed import Event, Scheduler, Nanny, Worker, wait

await asyncio.sleep(0.01)

# `_track_retire_worker` _should_ now be sleeping for 0.5s, because there were >=200 keys on A.
# In this test, everything from here on needs to happen within 0.5s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite true - everything from the very beginning of the replication needs to happen within 0.5s. Which I suspect it may cause flakiness on our CI.

assert isinstance(policy, RetireWorker)

# This will drop all the `xs` from A (since they're already replicated on B).
amm.run_once()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary. Dropping is not a precondition for done; it's just to reduce memory pressure in case of spilled keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling run_once is necessary, because what we wan to test is:

  • _track_retire_worker is sleeping
  • policy runs and removes itself because all keys have been replicated
  • another key appears on the retiring worker
  • _track_retire_worker wakes up

By running it once here, it will remove itself.

# This will drop all the `xs` from A (since they're already replicated on B).
amm.run_once()

# The policy has removed itself, because there's no more data in need of replication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what happens if you don't manually force run_once though.

@github-actions
Copy link
Contributor

Unit Test Results

       15 files  +       3         15 suites  +3   6h 35m 29s ⏱️ + 1h 4m 34s
  2 741 tests +       4    2 656 ✔️ +       9    80 💤  -   10    5 +  5 
20 291 runs  +3 901  19 335 ✔️ +3 728  938 💤 +155  18 +18 

For more details on these failures, see this check.

Results for commit a0d6f69. ± Comparison against base commit b837003.

gjoseph92 and others added 5 commits June 14, 2022 19:07
Co-authored-by: crusaderky <crusaderky@gmail.com>
Co-authored-by: crusaderky <crusaderky@gmail.com>
* If you set `poll_interval` to 0 or a small value in `_track_retire_worker`, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run.
* With `--count=1000 -n10` on my machine it's passed 100% of the time (no skips even)
* If you remove the critical change from `RetireWorkerPolicy.done()`, it always fails
Copy link
Collaborator Author

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky I've incorporated as much of your feedback as I could, but ended up having to walk back to more or less having the test look like I originally wrote it. That's the only way I can get it to actually test the condition from #6234.

I'm pretty confident this test won't become flaky:

  • If you set poll_interval to 0 or a small value in _track_retire_worker, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run.
  • With --count=1000 -n10 on my machine it's passed 100% of the time (no skips)
  • If you remove the critical change (RetireWorker policy is done if removed #6234) from RetireWorkerPolicy.done(), it always fails.

The main downside is that it's extremely reliant on some very specific behavior, so if the retire_workers mechanism or policy is refactored significantly, it might become meaningless. In that case though, I would still expect it to consistently fail, or skip itself.

assert isinstance(policy, RetireWorker)

# This will drop all the `xs` from A (since they're already replicated on B).
amm.run_once()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling run_once is necessary, because what we wan to test is:

  • _track_retire_worker is sleeping
  • policy runs and removes itself because all keys have been replicated
  • another key appears on the retiring worker
  • _track_retire_worker wakes up

By running it once here, it will remove itself.

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 36m 10s ⏱️ -23s
  2 866 tests +1    2 784 ✔️ +30    80 💤 ±0  2  - 26 
21 231 runs  +7  20 291 ✔️ +37  938 💤 +2  2  - 29 

For more details on these failures, see this check.

Results for commit b09b1da. ± Comparison against base commit 344868a.

@gjoseph92
Copy link
Collaborator Author

The test has passed (not skipped) on every CI run here.

Flaky tests are:

Ready for final review.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll leave some time for @crusaderky since he's obviously the expert. but will merge tomorrow morning (my time) in case there is no more feedback.

@fjetter fjetter merged commit 33c5cb2 into dask:main Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants