Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure distributed.comm.core.connect can always be cancelled #6064

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Apr 4, 2022

I noticed that tests like test_worker_waits_for_scheduler are not properly running even though the test logic is extremely simple.

The test test_worker_waits_for_scheduler particularly hangs in the asyncio.wait_for statement waiting for the worker to come up.
After some debugging, it appears the worker is stuck in the connect retry implementation trying to connect to the scheduler. This connect should've been cancelled after the wait_for hits its timeout but it wasn't.

It appears there is a race condition when cancelling a task. If the Task is cancelled right after it is finished and the exception is set, it will not raise a CancelledError but the original Exception. I don't know if this is a bug in CPython but it is not the behaviour I anticipated. I wrote the ensure_cancellation wrapper to make sure this behaviour is deterministic

cc @graingert

Closes #5861

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 27m 30s ⏱️ - 11m 23s
  2 750 tests +  5    2 663 ✔️ +  2       80 💤 ±0  7 +3 
21 885 runs  +40  20 840 ✔️ +35  1 037 💤 +2  8 +3 

For more details on these failures, see this check.

Results for commit 676bb18. ± Comparison against base commit fc5b460.

♻️ This comment has been updated with latest results.

Comment on lines +1626 to +1645
task.cancel()
await watcher.wait()
raise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if I need to await the task itself here. There are no warnings

cc @graingert

],
)
@gen_test()
async def test_connection_pool_cancellation(monkeypatch, closing, backend):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests do not capture this race condition but the rewrite is much cleaner and covers more ground so I'd like to keep it

@fjetter
Copy link
Member Author

fjetter commented Apr 19, 2022

There is a CPython issue which looks like the root cause, see python/cpython#86296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test_worker_waits_for_scheduler
1 participant