Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making AMM ReduceReplicas less aggressive towards widely-shared dependencies #6056

Open
gjoseph92 opened this issue Apr 1, 2022 · 2 comments
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better memory performance

Comments

@gjoseph92
Copy link
Collaborator

Corollary to #6038. In that issue, I described a situation where workers thought a key (which most tasks depended on) had 82 replicas, but in reality it only had 1.

This issue is about the fact that ReduceReplicas maybe shouldn't try to delete copies of that critical key so aggressively.

* * * * * *
\ \ \ / / /
    x y

In this case x and y are going to be reused by every task, so they will end up having replicas on most workers. Constantly deleting them is inefficient—as soon as you delete it, the next task that wants to run on that worker is going to have to transfer it back again.

(Of course, once most of the * tasks are done, then you should start reducing replicas. But while the cluster is fully saturated with * tasks, there's no benefit to doing this.)

I'm not sure what metric to use for this. Ideas explored in #4967, #5325, #5326 could be interesting here.

Really, this issue is just about how to calculate a smarter target for this desired_replicas count automatically based on the task's waiters, number of current workers, etc.:

desired_replicas = 1 # TODO have a marker on TaskState

@gjoseph92 gjoseph92 added enhancement Improve existing functionality or make things work better performance discussion Discussing a topic with no specific actions yet labels Apr 1, 2022
@crusaderky
Copy link
Collaborator

crusaderky commented May 4, 2022

This issue is about the fact that ReduceReplicas maybe shouldn't try to delete copies of that critical key so aggressively.
[...]
Really, this issue is just about how to calculate a smarter target for this desired_replicas count automatically based on the task's waiters, number of current workers, etc.:

This is already the case:

# If a dependent task has not been assigned to a worker yet, err on the side
# of caution and preserve an additional replica for it.
# However, if two dependent tasks have been already assigned to the same
# worker, don't double count them.
nwaiters = len({waiter.processing_on or waiter for waiter in ts.waiters})
ndrop_key = len(ts.who_has) - max(desired_replicas, nwaiters)

@crusaderky
Copy link
Collaborator

@gjoseph92 is the clarification above sufficient? Can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better memory performance
Projects
None yet
Development

No branches or pull requests

2 participants