-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Stimulus ID's via argument passing #6095
Conversation
The tip 0a0f4bc produces a green suite, while the PR sync event produces unrelated failures:
|
Taking a brief look at this PR things seem ok. Nothing yells "danger" at me. However, at the same time, the value of this isn't immediately apparent. I went to the tests to figure out why this would be valuable and mostly saw that we were just handling inertia of existing tests (which concerns me a bit, but is unrelated to this PR). So I then went to the issue that caused this PR, #5849, and read this:
OK, so this sounds pretty interesting and definitely valuable. Does the work in this PR accomplish closing the circle here? If so, is there something that we could do to verify success? In general I'm a big fan of testing things by testing user API. In this case, what do stimulus IDs buy us? What kind of workflow would this PR help us to accomplish? Is there a way to do that workflow in a test to both ...
Thoughts? |
|
distributed/tests/test_scheduler.py
Outdated
"update-graph-hlg", | ||
"task-finished", | ||
"client-releases-keys", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why are there two update-graph-hlg stimuli here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're actually the same stimulus_id. Stimuli are of the form f{name}-{time()}
so only the prefix is tested here. The stimulus_id is duplicated because there are 4 events in the scheduler transition log story.
[ # Expected subset of transition log
(key, "released", "waiting", {key: "processing"}),
(key, "waiting", "processing", {}),
(key, "processing", "memory", {}),
(key, "memory", "forgotten", {} ),
],
In the above case, a single update-graph-{time()
stimulus is associated with (1) the transition from "released" to "waiting" (2) "waiting" to "processing". Once the task successfully completes on a worker, a new task-finished-{time()}
stimulus_id is generated and sent to the scheduler, where (3) the task is transitioned from "processing" to "memory". Once the client has received the result, the scheduler releases the keys ("client-release-keys-time()"
) and (4) transitions the task from "memory" to "forgotten"
{}, | ||
), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems pretty brittle to me. Any time we need to change the state machine we'll need to change all tests that look like this. I think that tests like this add a fair amount of inertia to the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this statement is broadly true for complex workflows. As discussed elsewhere, the transition_logs (and thus stories) get non-deterministic very quickly for complex workflows.
This is why the actual graph is very simple c.submit(inc, 1)
for this test case.
Yeah, I don't need us to have a user API for this just yet, but I do think that we should think about the invariants that we would expect here. For example, maybe we run a mildly complex computation (like a dask array comptuation), and then track that every transition goes back to the same specific update-graph call? Then maybe we remove a worker and track all subsequent transitions to that? Stepping back for a moment, if I go and add some code in the scheduler or in the worker and I route stimulus IDs incorrectly, are there tests that will tell me that I've made a mistake? |
If there is some way to verify things in the validation system that would be great. The link you pointed to didn't include line numbers. Did you mean to point to something specific there? |
Ah, I think that we're just verifying that there exists some ID in validation. Is that correct? Also, just to lower expectations here, it may be that you don't actually have all of the information to test this properly. It may be that the main use case here was from someone like @fjetter and that he actually needs to engage in the design of the full test. I might also be wrong here and am happy to be overruled. In general though, if we're going to make a feature then it would be good to see that feature used in the way in which it was intended to be used in some test. My sense right now is that this was intended to be used to fuse scheduler and worker stories together to trace what caused some event. I think that we should demonstrate that ability at a high level in some test before we claim victory here. Again, happy to be overruled. |
No, not at this point. I did think a bit about more complex test cases, but one thing I noted when implementing #5987 is that the transition logs for multiple keys and complex workflows quickly become non-deterministic. In fact, I tried to make one of the existing story test cases a bit more representatve in this PR https://github.com/dask/distributed/pull/6095/files#diff-ee66fe9141ad4e11edb332284757324f3026e7126c39b881aa7d637224b4c92fR1691 assert {sid[: re.search(r"\d", sid).start()] for sid in stimulus_ids} == {
"compute-task-",
"ensure-communicating-",
"task-finished-",
}
Yeah, I think github didn't expand scheduler.py which made the link fail: https://github.com/dask/distributed/pull/6095/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R2387 if not stimulus_id:
stimulus_id = STIMULUS_ID_UNSET
finish2 = ts._state
# FIXME downcast antipattern
scheduler = pep484_cast(Scheduler, self)
scheduler.transition_log.append(
(key, start, finish2, recommendations, stimulus_id, time())
)
if parent._validate:
if stimulus_id == STIMULUS_ID_UNSET:
raise RuntimeError(
"stimulus_id not set during Scheduler transition"
)
Correct
The strongest invariant I can think of at the moment is a set of stimulus prefixes. I think what concerns me is non-determinism in the transition_log. Testing that the transition_log adheres to a particular structure would become very difficult very quickly in a complex workflow. |
I added a test case illustrating stimulus_id's in the case of a retry operation. The scheduler story is still deterministic so the test works from that perspective. I've placed the stimulus_id prefixes in a set (as opposed to the list in test_stimulus_success). I think as long as the story is deterministic we can test lists, but this will prove more difficult once we start querying multiple keys in a story -- in these cases, testing with a set of stimulus_id prefixes may be the best we can do |
Yeah, I'm actually pretty against the Testing invariants sounds much nicer to me. It sounds like right now you're thinking of invarians like the names To me, the right invariant is in this sentence from before.
I'd like to be able to see that we can link all transitions that were caused by an event. I think that this means that we need to merge the transition log of the scheduler and workers and see that everything flows smoothly. I absolutely do not care that the name is |
A couple of comments on this point:
|
Yeah sorry, this is what I meant. We want to track back some event on the worker to something that happened elsewhere. I used the client as an example here, but didn't intend for it to be the only thing that we care about.
Yes, I could imagine that a story function that combined transitions across scheduler+workers would be helpful in testing. I don't think that it's the only way to do this, but it would probably help a lot. Even then, we might ask how we would ask the question about how we can trace an event on a worker back to something on the scheduler, and that everything was accounted for. We don't necessarily need the client.story method to do this, especially if we just have one or two workers. We could do this manually for now. |
Anyway, to summarize my thoughts on this PR, the implementation seems ok. This isn't tested in a way that I would like or that gives me confidence that it does what it's trying to do. I'm coming into this conversation late though, it could be that folks already have a plan for this. It makes me nervous to merge code like this without it being obviously useful. That being said, I'm happy to be overruled. |
I think we are actually doing this as part of the
That's basically what we're looking for but I'm not sure if we can deliver this just now. I am hoping to hide transitions behind a more high level interface. I'm not even convinced anymore that the transitions system is a good fit for the worker so I'd be happy to have tests on a higher level for us to move more freely there. see #6092 for a small step towards this |
I've added an |
The stimulus flow tests have been removed and existing stimulus tests have been modified to only test for the number of unique stimuli. I think this is ready for another review round. |
|
||
story = s.story(f.key) | ||
stimulus_ids = {s[-2] for s in story} | ||
assert len(stimulus_ids) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why four exactly? If a future version of Dask made this 5 or 3 would that be incorrect? I think instead you're looking for a statement like "there are far fewer stimuli than there are events in the log" . Maybe assert len(stimulus_ids) < len(story) / 3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of yesterday's conversation with @fjetter was that testing the number of unique stimulus_id's was the strategy to use at present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but we don't need to be exact. What if a future change in the code change this number so that it became 3? Would that be a sign of an error? What if it was 5?
A change of that sort is likely to occur in the moderate future. When it occurs whoever is making the change now needs to look at this test and understand it and the intent. They've got to figure out "ok, why was it so important that there were exactly four kinds of stimuli here" and it'll take them a few hours to learn, I think, that actually this isn't critical at all, but that instead we're just looking to assert that there are only a few stimuli.
Is my thinking here correct? We don't actually care that the number is 4, we just care that it's something like "probably more than 1, but also way less than the length of the story". If that's true, then let's assert that, rather than assert a very specific value here.
We shouldn't assert things that we don't really care about. I don't think that we really care that this number is four.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A change of that sort is likely to occur in the moderate future. When it occurs whoever is making the change now needs to look at this test and understand it and the intent. They've got to figure out "ok, why was it so important that there were exactly four kinds of stimuli here" and it'll take them a few hours to learn, I think, that actually this isn't critical at all, but that instead we're just looking to assert that there are only a few stimuli.
Is my thinking here correct? We don't actually care that the number is 4, we just care that it's something like "probably more than 1, but also way less than the length of the story". If that's true, then let's assert that, rather than assert a very specific value here.
We shouldn't assert things that we don't really care about. I don't think that we really care that this number is four.
I've thought about this further after watching the testing philosophy video. I personally do care about changes of this sort and want to know about them, even at the cost of spending time understanding why they occurred. And I would want other developers to be aware that their changes modify things because the distributed cluster is a complex stateful system. IMO there are places for testing via interfaces, but not at the lowest level.
assert len(stimulus_ids) < len(story) / 3
for example provokes a strong code smell response in me. It might protect a developer from having to understand a failed test, but IMO it makes it more difficult to understand what's being tested and more importantly if there is a problem it might:
- Hide it from the developer which I think it is a big deal in the context of distributed stability and deadlocks.
- Incur even more cognitive overhead in understanding the abstraction that has been built into the test and whether it's wrong/right.
At least with an explicit test I know what the expectation was before the change and can update expectations going forward. I personally think that given the current focus on stability, the tradeoff between
- a higher dev velocity and,
- developers having to spend time developing a deeper understanding of distributed
would lead me to choose (2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was historically my choice as well. After doing this for a while though I found that modest changes to the codebase required me to change hundreds of different tests. It was not a productive use of time, and only I was capable of changing the codebase. As a result of this experience we now prefer tests that try to avoid having developers to learn about unrelated parts of the codebase.
If we all have to learn about the entire codebase then I don't think that we'll be able to move at all.
I'd be happy to engage further in this conversation, but for the moment I'm -1 on this PR for as long as it's testing in this way. I think that the inertia that it adds to the codebase does not outweigh the benefit of the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phrasing this differently, the only reason you're able to write this PR without touching several thousand of lines of code and spend a few months studying is due to this policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(stimulus_ids) < len(story) / 3
is a property that should be true all the time but it is not the most interesting property for what we're looking at here.
These IDs we are asserting on are incredibly high level.
- "compute something"
- "something is done"
- "retry something"
- "something is done, again"
If I had a reasonable way of encoding this information without hardcoding the actual names (like update-graph-hlg), I would prefer writing the explicit four events out. I considered using the uniqueness was the compromise we agreed on.
If we want to relax this, we can set it to assert len(stimulus_ids) >= 4
but I would be perfectly fine keeping it the way it is. Anybody who is doing a change to this system intentionally will not require more than a few seconds debugging this if it would ever fail and we gain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this conversation
Tests suites are all green and this is good for merge. |
Just in case it was missed, I stated a -1 above. It is atypical in Dask development to merge when one of the core maintainers has a -1 on a PR. @sjperkins I appreciate that you're trying to get this in quickly (I love the sense of urgency, and the desire to be complete). Given the broader testing discussion I recommend that we allow ourselves to fail on the timeline here. Let @fjetter and I sync up on broader testing practices in Dask. For now I think that you should probably move on to other things. |
Or, if you did want to meet the timeline, then probably you would accept my comments above and remove the precise tests above, adding just enough to also make @fjetter happy (I suspect that he's less picky on this one than I am). I think that waiting is fine though, especially given the holiday. |
Closed in favour of #6161 |
Client.story
- Support collecting cluster-wide story for a key or stimulus ID #5872pre-commit run --all-files