Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Stimulus ID's via argument passing #6095

Closed
wants to merge 27 commits into from

Conversation

sjperkins
Copy link
Member

@sjperkins sjperkins commented Apr 8, 2022

@github-actions
Copy link
Contributor

github-actions bot commented Apr 8, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 14m 12s ⏱️ - 4m 42s
  2 743 tests +  5    2 662 ✔️ +  6       81 💤 ±0  0  - 1 
21 829 runs  +40  20 796 ✔️ +41  1 033 💤 ±0  0  - 1 

Results for commit 428dcb9. ± Comparison against base commit 41ecbca.

♻️ This comment has been updated with latest results.

@sjperkins
Copy link
Member Author

sjperkins commented Apr 11, 2022

The tip 0a0f4bc produces a green suite, while the PR sync event produces unrelated failures:

@mrocklin
Copy link
Member

Taking a brief look at this PR things seem ok. Nothing yells "danger" at me. However, at the same time, the value of this isn't immediately apparent. I went to the tests to figure out why this would be valuable and mostly saw that we were just handling inertia of existing tests (which concerns me a bit, but is unrelated to this PR).

So I then went to the issue that caused this PR, #5849, and read this:

The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.

OK, so this sounds pretty interesting and definitely valuable. Does the work in this PR accomplish closing the circle here? If so, is there something that we could do to verify success? In general I'm a big fan of testing things by testing user API. In this case, what do stimulus IDs buy us? What kind of workflow would this PR help us to accomplish? Is there a way to do that workflow in a test to both ...

  1. demonstrate to ourselves that we've accomplished our objective
  2. make sure that this functionality remains working as the code here changes

Thoughts?

@sjperkins
Copy link
Member Author

sjperkins commented Apr 11, 2022

OK, so this sounds pretty interesting and definitely valuable. Does the work in this PR accomplish closing the circle here? If so, is there something that we could do to verify success?

test_stimuli does a basic test of a stimulus flow through the cluster. On the other end of the spectrum we could test the complete set of possible transitions and their associated stimuli but this is probably not a valuable spend of time. When the scheduler is tested with validation on, we can at least know that all the transitions tested by the test suite set a stimulus in the transition log.

In general I'm a big fan of testing things by testing user API. In this case, what do stimulus IDs buy us? What kind of workflow would this PR help us to accomplish? Is there a way to do that workflow in a test to both ...

1. demonstrate to ourselves that we've accomplished our objective
2. make sure that this functionality remains working as the code here changes

test_stimuli could be changed to test with dask.array, for example. I think this would still involve testing each key individually in a loop to introduce determinism in the transition log. However, stimuli are diagnostic tools used in distributed tracing (called span's in the linked article) -- I'm not convinced that testing via the user API is going to buy us much in terms of validating scheduler internals.

"update-graph-hlg",
"task-finished",
"client-releases-keys",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why are there two update-graph-hlg stimuli here?

Copy link
Member Author

@sjperkins sjperkins Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're actually the same stimulus_id. Stimuli are of the form f{name}-{time()} so only the prefix is tested here. The stimulus_id is duplicated because there are 4 events in the scheduler transition log story.

        [  # Expected subset of transition log
            (key, "released", "waiting", {key: "processing"}),
            (key, "waiting", "processing", {}),
            (key, "processing", "memory", {}),
            (key, "memory", "forgotten", {} ),
        ],

In the above case, a single update-graph-{time() stimulus is associated with (1) the transition from "released" to "waiting" (2) "waiting" to "processing". Once the task successfully completes on a worker, a new task-finished-{time()} stimulus_id is generated and sent to the scheduler, where (3) the task is transitioned from "processing" to "memory". Once the client has received the result, the scheduler releases the keys ("client-release-keys-time()") and (4) transitions the task from "memory" to "forgotten"

{},
),
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems pretty brittle to me. Any time we need to change the state machine we'll need to change all tests that look like this. I think that tests like this add a fair amount of inertia to the test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this statement is broadly true for complex workflows. As discussed elsewhere, the transition_logs (and thus stories) get non-deterministic very quickly for complex workflows.

This is why the actual graph is very simple c.submit(inc, 1) for this test case.

@mrocklin
Copy link
Member

test_stimuli could be changed to test with dask.array, for example. I think this would still involve testing each key individually in a loop to introduce determinism in the transition log. However, stimuli are diagnostic tools used in distributed tracing (called span's in the linked article) -- I'm not convinced that testing via the user API is going to buy us much in terms of validating scheduler internals

Yeah, I don't need us to have a user API for this just yet, but I do think that we should think about the invariants that we would expect here. For example, maybe we run a mildly complex computation (like a dask array comptuation), and then track that every transition goes back to the same specific update-graph call? Then maybe we remove a worker and track all subsequent transitions to that?

Stepping back for a moment, if I go and add some code in the scheduler or in the worker and I route stimulus IDs incorrectly, are there tests that will tell me that I've made a mistake?

@mrocklin
Copy link
Member

If there is some way to verify things in the validation system that would be great. The link you pointed to didn't include line numbers. Did you mean to point to something specific there?

@mrocklin
Copy link
Member

Ah, I think that we're just verifying that there exists some ID in validation. Is that correct?

Also, just to lower expectations here, it may be that you don't actually have all of the information to test this properly. It may be that the main use case here was from someone like @fjetter and that he actually needs to engage in the design of the full test.

I might also be wrong here and am happy to be overruled. In general though, if we're going to make a feature then it would be good to see that feature used in the way in which it was intended to be used in some test. My sense right now is that this was intended to be used to fuse scheduler and worker stories together to trace what caused some event. I think that we should demonstrate that ability at a high level in some test before we claim victory here. Again, happy to be overruled.

@sjperkins
Copy link
Member Author

sjperkins commented Apr 12, 2022

Yeah, I don't need us to have a user API for this just yet, but I do think that we should think about the invariants that we would expect here. For example, maybe we run a mildly complex computation (like a dask array comptuation), and then track that every transition goes back to the same specific update-graph call? Then maybe we remove a worker and track all subsequent transitions to that?

Stepping back for a moment, if I go and add some code in the scheduler or in the worker and I route stimulus IDs incorrectly, are there tests that will tell me that I've made a mistake?

No, not at this point. I did think a bit about more complex test cases, but one thing I noted when implementing #5987 is that the transition logs for multiple keys and complex workflows quickly become non-deterministic. In fact, I tried to make one of the existing story test cases a bit more representatve in this PR https://github.com/dask/distributed/pull/6095/files#diff-ee66fe9141ad4e11edb332284757324f3026e7126c39b881aa7d637224b4c92fR1691

    assert {sid[: re.search(r"\d", sid).start()] for sid in stimulus_ids} == {
        "compute-task-",
        "ensure-communicating-",
        "task-finished-",
    }

If there is some way to verify things in the validation system that would be great. The link you pointed to didn't include line numbers. Did you mean to point to something specific there?

Yeah, I think github didn't expand scheduler.py which made the link fail: https://github.com/dask/distributed/pull/6095/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R2387

           if not stimulus_id:
                stimulus_id = STIMULUS_ID_UNSET

            finish2 = ts._state
            # FIXME downcast antipattern
            scheduler = pep484_cast(Scheduler, self)
            scheduler.transition_log.append(
                (key, start, finish2, recommendations, stimulus_id, time())
            )
            if parent._validate:
                if stimulus_id == STIMULUS_ID_UNSET:
                    raise RuntimeError(
                        "stimulus_id not set during Scheduler transition"
                    )

Ah, I think that we're just verifying that there exists some ID in validation. Is that correct?

Correct

Also, just to lower expectations here, it may be that you don't actually have all of the information to test this properly. It may be that the main use case here was from someone like @fjetter and that he actually needs to engage in the design of the full test.

The strongest invariant I can think of at the moment is a set of stimulus prefixes. I think what concerns me is non-determinism in the transition_log. Testing that the transition_log adheres to a particular structure would become very difficult very quickly in a complex workflow.

@sjperkins
Copy link
Member Author

I added a test case illustrating stimulus_id's in the case of a retry operation. The scheduler story is still deterministic so the test works from that perspective. I've placed the stimulus_id prefixes in a set (as opposed to the list in test_stimulus_success). I think as long as the story is deterministic we can test lists, but this will prove more difficult once we start querying multiple keys in a story -- in these cases, testing with a set of stimulus_id prefixes may be the best we can do

@mrocklin
Copy link
Member

is that the transition logs for multiple keys and complex workflows quickly become non-deterministic

Yeah, I'm actually pretty against the assert_story style tests. Any test that tried to assert a very specific ordering of things would be, I think, very brittle, and so probably wouldn't provide that much value over time. It's useful now, to verify that we've done things correctly, but I think that it'll get deleted after only modest changes to the codebase.

Testing invariants sounds much nicer to me. It sounds like right now you're thinking of invarians like the names update-graph-hlg but even that is pretty brittle (I plan to remove the -hlg prefix soon, for example). We shouldn't be using exactly these names in tests I don't think (unless we're expecting me to be able to understand all of these tests, and so this system while I change names around in the future). What are other invariants?

To me, the right invariant is in this sentence from before.

The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event

I'd like to be able to see that we can link all transitions that were caused by an event. I think that this means that we need to merge the transition log of the scheduler and workers and see that everything flows smoothly. I absolutely do not care that the name is update-graph-hlg vs update-graph for example, and so I wouldn't want that to be codified in a test. I do care that we're providing a consistent history across the scheduler and workers such that we can trace any event on a worker back to a client call. I don't see anything in this PR that gives confidence that we're achieving that. Thoughts?

@sjperkins
Copy link
Member Author

I do care that we're providing a consistent history across the scheduler and workers such that we can trace any event on a worker back to a client call. I don't see anything in this PR that gives confidence that we're achieving that. Thoughts?

A couple of comments on this point:

@mrocklin
Copy link
Member

This implementation specifically tries not to identify Client calls, but instead generates most Client-related stimuli on the Scheduler. @fjetter and I have been proceeding on this assumption and the original issue specified this #5849

Yeah sorry, this is what I meant. We want to track back some event on the worker to something that happened elsewhere. I used the client as an example here, but didn't intend for it to be the only thing that we care about.

#5872 implies a merged view of Scheduler and Worker stories. I've got a branch based on this PR (it relies on the Scheduler transition_log format change). Perhaps the decision to make is whether the close-the-circle test should happen in the Client Story PR?

Yes, I could imagine that a story function that combined transitions across scheduler+workers would be helpful in testing. I don't think that it's the only way to do this, but it would probably help a lot. Even then, we might ask how we would ask the question about how we can trace an event on a worker back to something on the scheduler, and that everything was accounted for. We don't necessarily need the client.story method to do this, especially if we just have one or two workers. We could do this manually for now.

@mrocklin
Copy link
Member

Anyway, to summarize my thoughts on this PR, the implementation seems ok.

This isn't tested in a way that I would like or that gives me confidence that it does what it's trying to do. I'm coming into this conversation late though, it could be that folks already have a plan for this. It makes me nervous to merge code like this without it being obviously useful. That being said, I'm happy to be overruled.

sjperkins added a commit to sjperkins/distributed that referenced this pull request Apr 12, 2022
@fjetter
Copy link
Member

fjetter commented Apr 13, 2022

I reiterated that I mostly care about building a test that combines worker and scheduler state and verifies that we "close the circle".

I think we are actually doing this as part of the assert_story tests which, from my POV is just fine for now

I'd like to see something that doesn't rely on assert_story, doesn't rely on an exact transition history, but instead tests pretty much only the "close the circle" objective.

That's basically what we're looking for but I'm not sure if we can deliver this just now. I am hoping to hide transitions behind a more high level interface. I'm not even convinced anymore that the transitions system is a good fit for the worker so I'd be happy to have tests on a higher level for us to move more freely there.
I'd rather just log high level decisions / events and assert on them. This is less verbose, easier to read, easier to reason about, etc.

see #6092 for a small step towards this

@sjperkins
Copy link
Member Author

sjperkins commented Apr 13, 2022

I've added an assert_stimulus_flow that asserts that a cluster story adheres to a tree-like flow in 0913001. This adds its own complexity and will need tests in it's own right, but I thought it would be good starting point for discussion.

@sjperkins
Copy link
Member Author

The stimulus flow tests have been removed and existing stimulus tests have been modified to only test for the number of unique stimuli.

I think this is ready for another review round.

distributed/client.py Outdated Show resolved Hide resolved
distributed/tests/test_client.py Outdated Show resolved Hide resolved

story = s.story(f.key)
stimulus_ids = {s[-2] for s in story}
assert len(stimulus_ids) == 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why four exactly? If a future version of Dask made this 5 or 3 would that be incorrect? I think instead you're looking for a statement like "there are far fewer stimuli than there are events in the log" . Maybe assert len(stimulus_ids) < len(story) / 3 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of yesterday's conversation with @fjetter was that testing the number of unique stimulus_id's was the strategy to use at present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but we don't need to be exact. What if a future change in the code change this number so that it became 3? Would that be a sign of an error? What if it was 5?

A change of that sort is likely to occur in the moderate future. When it occurs whoever is making the change now needs to look at this test and understand it and the intent. They've got to figure out "ok, why was it so important that there were exactly four kinds of stimuli here" and it'll take them a few hours to learn, I think, that actually this isn't critical at all, but that instead we're just looking to assert that there are only a few stimuli.

Is my thinking here correct? We don't actually care that the number is 4, we just care that it's something like "probably more than 1, but also way less than the length of the story". If that's true, then let's assert that, rather than assert a very specific value here.

We shouldn't assert things that we don't really care about. I don't think that we really care that this number is four.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A change of that sort is likely to occur in the moderate future. When it occurs whoever is making the change now needs to look at this test and understand it and the intent. They've got to figure out "ok, why was it so important that there were exactly four kinds of stimuli here" and it'll take them a few hours to learn, I think, that actually this isn't critical at all, but that instead we're just looking to assert that there are only a few stimuli.

Is my thinking here correct? We don't actually care that the number is 4, we just care that it's something like "probably more than 1, but also way less than the length of the story". If that's true, then let's assert that, rather than assert a very specific value here.

We shouldn't assert things that we don't really care about. I don't think that we really care that this number is four.

I've thought about this further after watching the testing philosophy video. I personally do care about changes of this sort and want to know about them, even at the cost of spending time understanding why they occurred. And I would want other developers to be aware that their changes modify things because the distributed cluster is a complex stateful system. IMO there are places for testing via interfaces, but not at the lowest level.

assert len(stimulus_ids) < len(story) / 3 for example provokes a strong code smell response in me. It might protect a developer from having to understand a failed test, but IMO it makes it more difficult to understand what's being tested and more importantly if there is a problem it might:

  1. Hide it from the developer which I think it is a big deal in the context of distributed stability and deadlocks.
  2. Incur even more cognitive overhead in understanding the abstraction that has been built into the test and whether it's wrong/right.

At least with an explicit test I know what the expectation was before the change and can update expectations going forward. I personally think that given the current focus on stability, the tradeoff between

  1. a higher dev velocity and,
  2. developers having to spend time developing a deeper understanding of distributed

would lead me to choose (2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was historically my choice as well. After doing this for a while though I found that modest changes to the codebase required me to change hundreds of different tests. It was not a productive use of time, and only I was capable of changing the codebase. As a result of this experience we now prefer tests that try to avoid having developers to learn about unrelated parts of the codebase.

If we all have to learn about the entire codebase then I don't think that we'll be able to move at all.

I'd be happy to engage further in this conversation, but for the moment I'm -1 on this PR for as long as it's testing in this way. I think that the inertia that it adds to the codebase does not outweigh the benefit of the feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phrasing this differently, the only reason you're able to write this PR without touching several thousand of lines of code and spend a few months studying is due to this policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(stimulus_ids) < len(story) / 3 is a property that should be true all the time but it is not the most interesting property for what we're looking at here.

These IDs we are asserting on are incredibly high level.

  • "compute something"
  • "something is done"
  • "retry something"
  • "something is done, again"

If I had a reasonable way of encoding this information without hardcoding the actual names (like update-graph-hlg), I would prefer writing the explicit four events out. I considered using the uniqueness was the compromise we agreed on.

If we want to relax this, we can set it to assert len(stimulus_ids) >= 4 but I would be perfectly fine keeping it the way it is. Anybody who is doing a change to this system intentionally will not require more than a few seconds debugging this if it would ever fail and we gain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving this conversation

distributed/tests/test_scheduler.py Show resolved Hide resolved
@sjperkins
Copy link
Member Author

Tests suites are all green and this is good for merge.

@mrocklin
Copy link
Member

Just in case it was missed, I stated a -1 above. It is atypical in Dask development to merge when one of the core maintainers has a -1 on a PR.

@sjperkins I appreciate that you're trying to get this in quickly (I love the sense of urgency, and the desire to be complete). Given the broader testing discussion I recommend that we allow ourselves to fail on the timeline here. Let @fjetter and I sync up on broader testing practices in Dask. For now I think that you should probably move on to other things.

@mrocklin
Copy link
Member

Or, if you did want to meet the timeline, then probably you would accept my comments above and remove the precise tests above, adding just enough to also make @fjetter happy (I suspect that he's less picky on this one than I am). I think that waiting is fine though, especially given the holiday.

@fjetter fjetter mentioned this pull request Apr 20, 2022
@sjperkins
Copy link
Member Author

Closed in favour of #6161

@sjperkins sjperkins closed this Apr 29, 2022
@sjperkins sjperkins deleted the stimulus-id-kwargs branch April 29, 2022 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants