Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stimulus ids scheduler #6161

Merged
merged 33 commits into from
Apr 26, 2022
Merged

Stimulus ids scheduler #6161

merged 33 commits into from
Apr 26, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Apr 20, 2022

This builds on #6095

High level changes

  • The Client.story now includes a prefix corresponding to scheduler or worker address to distinguish the events
  • I factored out utils_test.assert_valid_story
  • I moved all tests who's sole purpose is to assert how a story looks like, what properties it has, etc. but doesn't test much else in a dedicated test module test_stories.py. I kept trivial story references since I believe this adds value (e.g. simpy compute a task) and shrunk the asserted stories in other cases to what I consider the most valuable part (e.g. task_erred now asserts on a partial story instead of the full one)
  • distributed.stories is now private, i.e. _stories
  • I went over all other uses of assert_story outside of test_stories.py and ensured that their usage is required. I encountered at least one case where a property based test is a better fit. All other usages require the thorough story to catch race conditions
  • I added a warning to the documentation of assert_story to instruct people to use this with care since this low level testing comes with cost

cc @mrocklin @sjperkins

sjperkins and others added 30 commits April 8, 2022 14:12
Comment on lines 140 to 141
w_story = [msg[:-2] for msg in a.story(k) + b.story(k)]
assert_story(task_story, w_story)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example where I removed the explicit story in favour of a property based approach. we care that the dump includes the same story as the workers would print them if asked directly. We don't care about how the story looks like but we care that the dump and the true server instance provide the same answer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this type of change.

Feel free to pass on this suggestion, but is there any way to remove the msg[:-2] bit from each test? Could this be normalized in assert_story for example? This seems like the kind of thing that we might change in the future, and it would be unfortunate if this pattern was picked up and copied by future devs.

Copy link
Member Author

@fjetter fjetter Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be normalized in assert_story for example?

That actually caught me off guard as well and I didn't expect to need this. The documentation of assert_story mentions this but I was not aware of it. It caught be off guard. I'll have a quick look if the assert_story can be easily adapted.

Edit: Done


I also think this is a nice example for the "testing pyramid" I mentioned earlier. This test relies on the assumption that Worker.story and Scheduler.story work as advertised. This test is testing another feature, the cluster dump, that is working on a higher level. It combines multiple features behind the scenes and when testing this composite feature we care that the composition works as intended, not that the individual components work as intended.
To ensure that the individual components work as advertised, we will need dedicated unit tests for the components. Particularly, if there are nasty edge cases in the functionality of these components, they should be tested on unit test level, not on the composite level. The "pyramid" would be made up of one composite test, many low level tests that assert on edge cases (str input vs TaskState input, empty stories, format of stories, etc.). These low level unit tests can now be found in test_stories.py (how to organize tests is yet another conversation I would propose to skip for now)

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has been moved from test_worker.py I added it a while ago because the stories where actually broken. Specifically, the recommendations where not there because at some point we're popping off the recommendations dict. In this case, I consider one test that asserts how a worker story looks like specifically to be valuable. Since I introduced it, I don't think we had to touch it often

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you care about these specific stories? Or are we just copying our current state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of testing stories within test_stories.py. This makes it seem to me like we're testing that stories themselves work, rather than using stories to test other things in Dask (which would concern me).

Copy link
Member Author

@fjetter fjetter Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you care about these specific stories? Or are we just copying our current state?

Something in between. I think this test serves a few purposes

  1. Make sure that stories of a worker actually work. That's the trivial simple case. Verify it's format, verify how many stimulus IDs are in there (yes, I care about the exact number in this case), etc. I think this test is unique and the only place where we're doing this for the worker.
  2. There was a bug where dependencies were not included in the recommendations part of the transition log. To capture this, we could transition to a property based testing approach where I generate a story, pick the 4th element of a message, if it exists, and assert that it's not empty.

The above two points would better be served by property based testing, e.g. what assert_valid_story does, with a few additions. There are a few more reasons why this test is more explicit

  1. I consider good tests to be part of developer documentation. I consider this test readable and a developer can have a look and see how a transition log actually looks like. What information is in there, what types, what kind of stati do we log, etc. This shows a very simple computation with two keys so every developer should be able to follow this and reason about it to get a better understanding about how a transition log is working.
  2. The last and most important reason why I care about the exact story in this example is admittedly a bit of a code smell / anti-pattern. This example shows the exact happy path evolution of a trivial flow through the worker state machine. Strictly speaking, this is a unit test of the worker state machine as well so we're mixing two concerns into one test. This part of the test should probably rather live in test_worker_state_machine.py but it is obviously very hard to define a sharp boundary.

This makes it seem to me like we're testing that stories themselves work

Yes, that's my intention. There are a few use cases where we need stories to test dask functionalities but they are rare, and should be rare, e.g. to protect us from specific race conditions if there is no other feasible way to test it. In these cases, they are required because we don't have a better tool at the moment. As long as these tests that rely on stories to test functionality remain rare, I'm fine with that. For me, this is a trade off between development velocity, technical debt (as in brittle tests) and protection from regressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically when faced with a challenge of how to test very specific behavior I've always tried to find a way around doing this. Dask has hundreds of tests for very specific situations. I think that it would be unfortunate if these all used transition logs or something similar. I think that we would not be able to move forward today if I had made that choice historically.

Stories are nicer than raw transition logs, I'll give you that. Maybe they're ok-enough? I don't know. I'm concerned because it's so attractive to go down that path, but then developers 1-2 years from now pay the price.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider good tests to be part of developer documentation.

I'm fine with that for developers of the story feature. However I'm not ok with that if this test will be sensitive to changes in other parts of the codebase.

If the answer here is "but we want developers to understand stories" then fine, I can accept that, but we should then have excellent documentation on stories and force all developers to go through it. I don't view this feature as being at that level of importance.

Tests as developer documentation can make sense if the tests are isolated. A developer will only run into a test if they're working on that part of the codebase. Presenting developers with failing tests randomly selected from the codebase is not an efficient way to train people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm cool with this test (or any very precise test) if it's unlikely that developers working in other parts of the codebase will trigger this test. Do we think that that's the case today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at how often this test (and why) it needed to be changed so far.

We touched it nine times already. With one very small exception, I consider the test valueable for all changes. We might not need it in its current state forever but right now I think it's more useful than a problem

Functionality around stories themselves

These are changes the test is actually supposed to test

Major refactorings (impacting stories as well)

  • Major refactoring where this test was critical a8d4ffa
  • Another big refactoring where this was important f6b2e03

Functional test that should be tested elsewhere but is visible in this test

These changes fix major bugs and the fix has an impact in the story. The story should not be used as the primary test source and indeed there are dedicated tests for the regressions themselves.

  • Fix a deadlock that can be tested by asserting on unique stim IDs 1670cf8
  • Fix a major bug that changes the story. There is a proper test for the functionality as well 833c5f6

Noisy changes due to refactoring / story format changes, etc.

git log -L1736,+42:distributed/tests/test_worker.py -s
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)

commit c4de83ff35966a1a86f4e23a705fc7dcdbd65683
Author: crusaderky <crusaderky@gmail.com>
Date:   Thu Dec 16 17:08:23 2021 +0000

    assert_worker_story (#5598)

commit 03647c7d946cc47138b2dec223465a7cea77360a
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Thu Nov 18 22:45:16 2021 +0100

    Resolve `KeyError`-related deadlock (#5525)

commit 1670cf85b815a7f547fc040f3f74a24c679dbded
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Fri Oct 22 19:44:35 2021 +0200

    Ensure resumed flight tasks are still fetched (#5426)

commit a8d4ffa156cac22b05b9cef3a0599a08e9945146
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Mon Sep 27 14:05:23 2021 +0200

    Worker state machine refactor (#5046)

    Co-authored-by: crusaderky <crusaderky@gmail.com>

commit 5e5b9835a5e6929bb97acf94f0a2eecccc0b6a2f
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Fri Jun 11 19:12:02 2021 +0200

    Forget erred tasks // Fix deadlocks on worker (#4784)

    * Add a test about expected task states in an exception case

    * Minor refactoring about who_has state transitions

    * Remvoe worker kwarg from gather_dep fetch transition

    * Add special case for inflight tasks without dependents

    * Fix compute deadlocks

    * Code review comments

    * Improve request X keys for task debug log

    * Before calling missing-data ensure key is removed from who_has

    * review comments

    * Add test case for same-host missing deps

    * Add test about superfluous data race condition

commit 833c5f6c040feaa4550fa343d8e6e4feef3f84d5
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Tue May 25 12:52:57 2021 +0200

    Ensure busy workloads properly look up who_has (#4793)

commit 0a014dac4a1edb090ee17027fabcd41cdd015553
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Wed May 5 11:15:32 2021 +0200

    Ensure deps are actually logged in worker (#4753)

    * Ensure deps are actually logged in worker

    * Log only keys for gather-dependency log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is already in main then I don't need to hold this up. However, I would like for us to pause on adding any more tests like this until we come to some agreement on these style of tests.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 20, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 34m 22s ⏱️ - 27m 14s
  2 735 tests +  7    2 655 ✔️ +  8       80 💤  - 1  0 ±0 
21 765 runs  +56  20 731 ✔️ +60  1 034 💤  - 4  0 ±0 

Results for commit fe988af. ± Comparison against base commit b934ae6.

♻️ This comment has been updated with latest results.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts, mostly for discussion

distributed/tests/test_cluster_dump.py Outdated Show resolved Hide resolved
Comment on lines 140 to 141
w_story = [msg[:-2] for msg in a.story(k) + b.story(k)]
assert_story(task_story, w_story)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this type of change.

Feel free to pass on this suggestion, but is there any way to remove the msg[:-2] bit from each test? Could this be normalized in assert_story for example? This seems like the kind of thing that we might change in the future, and it would be unfortunate if this pattern was picked up and copied by future devs.

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do you care about these specific stories? Or are we just copying our current state?

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of testing stories within test_stories.py. This makes it seem to me like we're testing that stories themselves work, rather than using stories to test other things in Dask (which would concern me).

distributed/tests/test_stories.py Show resolved Hide resolved
with pytest.raises(AssertionError):
assert_story(scheduler_story, worker_story, strict=strict)
with pytest.raises(AssertionError):
assert_story(worker_story, scheduler_story, strict=strict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this test :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is super clear. It seems very robust to me. It also seems fairly powerful (it's not an easy test to pass)

story = await c.story(f.key)

# Every event should be prefixed with it's origin
# This changes the format compared to default scheduler / worker stories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that if we're doing something that's pretty specific that we explain why we're doing what we're doing. This helps ...

  1. Give me the reviewer confidence that this test was intentional (rather than just copy-pasting current output)
  2. Gives futures developers context to debug this test when it fails on them

👍

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically when faced with a challenge of how to test very specific behavior I've always tried to find a way around doing this. Dask has hundreds of tests for very specific situations. I think that it would be unfortunate if these all used transition logs or something similar. I think that we would not be able to move forward today if I had made that choice historically.

Stories are nicer than raw transition logs, I'll give you that. Maybe they're ok-enough? I don't know. I'm concerned because it's so attractive to go down that path, but then developers 1-2 years from now pay the price.

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider good tests to be part of developer documentation.

I'm fine with that for developers of the story feature. However I'm not ok with that if this test will be sensitive to changes in other parts of the codebase.

If the answer here is "but we want developers to understand stories" then fine, I can accept that, but we should then have excellent documentation on stories and force all developers to go through it. I don't view this feature as being at that level of importance.

Tests as developer documentation can make sense if the tests are isolated. A developer will only run into a test if they're working on that part of the codebase. Presenting developers with failing tests randomly selected from the codebase is not an efficient way to train people.

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm cool with this test (or any very precise test) if it's unlikely that developers working in other parts of the codebase will trigger this test. Do we think that that's the case today?

distributed/worker.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Apr 22, 2022

One of the errors is about the task state machine. I think this has been reported (or even fixed) somewhere else

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4027, in validate_task
    self.validate_task_fetch(ts)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3969, in validate_task_fetch
    assert ts.who_has
AssertionError
2022-04-21 15:55:14,668 - distributed.core - ERROR - Invalid TaskState encountered for <TaskState "('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)" fetch>.
Story:
[("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'compute-task', 'compute-task-1650556512.9722297', 1650556513.0147333), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'released', 'waiting', 'waiting', {"('rechunk-split-b0e8392a02374c12bda4005e8cf8a92d', 333)": 'fetch', "('rechunk-split-b0e8392a02374c12bda4005e8cf8a92d', 334)": 'fetch'}, 'compute-task-1650556512.9722297', 1650556513.0148942), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'waiting', 'ready', 'ready', {}, 'ensure-communicating-1650556513.0294855', 1650556513.1823983), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'ready', 'executing', 'executing', {}, 'task-finished-1650556513.1840498', 1650556513.[2027](https://github.com/dask/distributed/runs/6114917291?check_suite_focus=true#step:11:2027)75), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'put-in-memory', 'task-finished-1650556513.2260528', 1650556513.2265954), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'executing', 'memory', 'memory', {"('rechunk-split-b0e8392a02374c12bda4005e8cf8a92d', 152)": 'executing'}, 'task-finished-1650556513.2260528', 1650556513.2266343), ('free-keys', ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)",), 'worker-connect-1650556507.2631679', 1650556514.2604678), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'release-key', 'worker-connect-1650556507.2631679', 1650556514.2604795), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'memory', 'released', 'released', {"('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)": 'forgotten'}, 'worker-connect-1650556507.2631679', 1650556514.260621), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'released', 'forgotten', 'forgotten', {}, 'worker-connect-1650556507.2631679', 1650556514.260636), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'ensure-task-exists', 'released', 'compute-task-1650556514.319492', 1650556514.6215317), ("('rechunk-merge-b0e8392a02374c12bda4005e8cf8a92d', 0, 83)", 'released', 'fetch', 'fetch', {}, 'compute-task-1650556514.319492', 1650556514.621598)]
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4027, in validate_task
    self.validate_task_fetch(ts)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3969, in validate_task_fetch
    assert ts.who_has
AssertionError

@crusaderky
Copy link
Collaborator

Copy link
Member Author

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears test_worker_story_with_deps is the last controversial test. It is not a new test and has been around for a while. As I commented, I think this test is fine the way it is. Particularly with it being moved to test_stories.py it should be easier to work with

@mrocklin are there any objections to merging this PR?

Comment on lines +126 to +130
async def test_worker_story_with_deps(c, s, a, b):
"""
Assert that the structure of the story does not change unintentionally and
expected subfields are actually filled
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at how often this test (and why) it needed to be changed so far.

We touched it nine times already. With one very small exception, I consider the test valueable for all changes. We might not need it in its current state forever but right now I think it's more useful than a problem

Functionality around stories themselves

These are changes the test is actually supposed to test

Major refactorings (impacting stories as well)

  • Major refactoring where this test was critical a8d4ffa
  • Another big refactoring where this was important f6b2e03

Functional test that should be tested elsewhere but is visible in this test

These changes fix major bugs and the fix has an impact in the story. The story should not be used as the primary test source and indeed there are dedicated tests for the regressions themselves.

  • Fix a deadlock that can be tested by asserting on unique stim IDs 1670cf8
  • Fix a major bug that changes the story. There is a proper test for the functionality as well 833c5f6

Noisy changes due to refactoring / story format changes, etc.

git log -L1736,+42:distributed/tests/test_worker.py -s
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)
commit f6b2e03cb659f19a516eb14c3270f9430a661af3
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Apr 8 21:00:31 2022 +0100

    Migrate ensure_executing transitions to new WorkerState event mechanism - part 2 (#6062)

commit 0f2674b61f27a87a779646009b7b1e936d28ef16
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Apr 4 17:54:24 2022 +0100

    Harden test_abort_execution_to_fetch and more (#6026)

commit c4de83ff35966a1a86f4e23a705fc7dcdbd65683
Author: crusaderky <crusaderky@gmail.com>
Date:   Thu Dec 16 17:08:23 2021 +0000

    assert_worker_story (#5598)

commit 03647c7d946cc47138b2dec223465a7cea77360a
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Thu Nov 18 22:45:16 2021 +0100

    Resolve `KeyError`-related deadlock (#5525)

commit 1670cf85b815a7f547fc040f3f74a24c679dbded
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Fri Oct 22 19:44:35 2021 +0200

    Ensure resumed flight tasks are still fetched (#5426)

commit a8d4ffa156cac22b05b9cef3a0599a08e9945146
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Mon Sep 27 14:05:23 2021 +0200

    Worker state machine refactor (#5046)

    Co-authored-by: crusaderky <crusaderky@gmail.com>

commit 5e5b9835a5e6929bb97acf94f0a2eecccc0b6a2f
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Fri Jun 11 19:12:02 2021 +0200

    Forget erred tasks // Fix deadlocks on worker (#4784)

    * Add a test about expected task states in an exception case

    * Minor refactoring about who_has state transitions

    * Remvoe worker kwarg from gather_dep fetch transition

    * Add special case for inflight tasks without dependents

    * Fix compute deadlocks

    * Code review comments

    * Improve request X keys for task debug log

    * Before calling missing-data ensure key is removed from who_has

    * review comments

    * Add test case for same-host missing deps

    * Add test about superfluous data race condition

commit 833c5f6c040feaa4550fa343d8e6e4feef3f84d5
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Tue May 25 12:52:57 2021 +0200

    Ensure busy workloads properly look up who_has (#4793)

commit 0a014dac4a1edb090ee17027fabcd41cdd015553
Author: Florian Jetter <fjetter@users.noreply.github.com>
Date:   Wed May 5 11:15:32 2021 +0200

    Ensure deps are actually logged in worker (#4753)

    * Ensure deps are actually logged in worker

    * Log only keys for gather-dependency log

@mrocklin mrocklin merged commit 3e0f702 into dask:main Apr 26, 2022
@fjetter fjetter deleted the stimulus_ids_scheduler branch April 26, 2022 12:43
crusaderky added a commit to crusaderky/distributed that referenced this pull request Apr 26, 2022
crusaderky added a commit that referenced this pull request Apr 26, 2022
@fjetter fjetter mentioned this pull request May 17, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants