O(1) rebalance #4774

crusaderky · 2021-04-30T10:03:01Z

DONE implementation
DONE stress test
DONE unit tests

In scope

Make rebalance() O(1) to the total number of tasks on the cluster.
Introduce hysteresis and thresholds to minimise the overall cost of rebalancing.
Consider non managed memory (leaks, fragmentation, etc.) while rebalancing.
Change algorithm that picks which tasks to move from largest first to least recently inserted first.

Out of scope, left for future PRs

This PR does not make rebalance safe to run while computations are running (nothing changes on that front).
This PR does not harmonize / deduplicate the code that moves data around used by rebalance(), replicate(), and retire_workers().
High level documentation containing recommendations regarding the use of malloc_trim, jemalloc, etc.

Demo

https://gist.github.com/crusaderky/c1ccf5fd0107b13c8d24bbed5197d5f6
(also works on master)

crusaderky · 2021-04-30T10:03:55Z

CC @jrbourbeau @mrocklin

crusaderky · 2021-04-30T10:23:10Z

rebalance() runtime when there is nothing to do:

# keys on cluster	# workers	master	this PR
10k	32	18.3 ~ 25.3 ms	1.1 ~ 1.9 ms
100k	32	232 ~ 260 ms	1.0 ~ 2.6 ms

crusaderky · 2021-04-30T12:21:31Z

Important

I noticed that, once the size of a single minimal Python object (read: the buffer of a numpy array) drops below 2 MiB, CPython / Linux x64 won't release RAM instantly anymore when the object is deallocated, but instead it will hog it indefinitely. gc.collect() does nothing. malloc_trim() typically, but not always, fixes the problem. Creating new Python objects simply reuses said memory.

This is the behaviour already in master - nothing changes there - and it already impacts all logic that relies on measuring process memory (spilling, pausing, and restarting). What does change with this PR is that now this unused but allocated memory is considered by rebalance() too, which in turn means that keys may be evicted too aggressively from particularly heavy nodes, as there is currently no way to tell apart trimmable memory from memory leaks or genuine fragmentation.

The current workaround is that I let the user pick what measure he wants to use in rebalance(); by default it is the optimistic memory (managed by dask in RAM + unmanaged older than 30s) but it can be changed to managed only, thus reverting to the behaviour of master. Needless to say this is not great as it adds config burden on the shoulders of the user.

An alternative, cleaner solution would be to run malloc_trim() periodically (e.g. 2~5s) on the workers, but problems will ensue if someone compiled the Python interpreter with an alternative alloc/free pair of primitives.

I have not tested the behaviour on MacOSX and Windows yet, but from what I saw in my previous PR I expect similar headaches there too.

distributed/distributed-schema.yaml

mrocklin · 2021-05-03T16:23:58Z

An alternative, cleaner solution would be to run malloc_trim() periodically (e.g. 2~5s) on the workers, but problems will ensue if someone compiled the Python interpreter with an alternative alloc/free pair of primitives.

Can I ask you to raise an issue about this and tag @pitrou and @jakirkham. I don't know much about this, but it seems like this might be helpful in the common case. "leaking" memory is a very common pain point today.

fjetter · 2021-05-05T09:44:20Z

An alternative, cleaner solution would be to run malloc_trim() periodically (e.g. 2~5s) on the workers, but problems will ensue if someone compiled the Python interpreter with an alternative alloc/free pair of primitives.

cc @xhochy

xhochy · 2021-05-05T10:06:27Z

An alternative, cleaner solution would be to run malloc_trim() periodically (e.g. 2~5s) on the workers, but problems will ensue if someone compiled the Python interpreter with an alternative alloc/free pair of primitives.

cc @xhochy

Thanks for the ping, I'll probably should comment on the (to-be-opened) issue instead of here?

fjetter

I like the simplicity and elegance of the algorithm. In particular that there are no opaque heuristics in there. Other than configuring the different thresholds, I think we have a nice lever to control different policies via the heap sort key (or rather the who_has sorting / insertion order / etc.)

I am a bit concerned about the dynamic case where this decision might become much more complicated. I would like to avoid the complexity we currently have in dask.order.
Also, I'm wondering how we can manage "data ownership", i.e. is a worker actually supposed to hold a replica or is it just a temp copy needed for a dependency, etc. I believe we do not have a data model for this, yet.

These concerns should not stop us here but rather should be kept in mind once we move on to the next step(s).

distributed/scheduler.py

fjetter · 2021-05-05T09:53:28Z

distributed/scheduler.py

+        memory_by_worker = [
+            (ws, getattr(ws.memory, MEMORY_REBALANCE_MEASURE)) for ws in workers
+        ]
+        mean_memory = sum(m for _, m in memory_by_worker) // len(memory_by_worker)


If performance is super critical, this could be done in one loop. This way we iterate three times over the workers. For sake of simplicity, it's probably fine to keep as is for now

I need to store the memory measure somewhere for later (calling WorkerState.memory is mildly expensive). List comprehension tends to be a lot faster than explicit for loop + appends.

distributed/scheduler.py

crusaderky · 2021-05-10T12:50:43Z

distributed/distributed.yaml

+        #     If this is your problem on Linux, you should alternatively consider
+        #     setting the MALLOC_TRIM_THRESHOLD_ environment variable (note the final
+        #     underscore) to a low value; refer to the mallopt man page and to the
+        #     comments about M_TRIM_THRESHOLD on
+        #     https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c


TODO as part of a later PR, I'm going to move this advice to a sphinx page for the sake of visibility and just leave a note here to go read it

crusaderky · 2021-05-17T23:15:37Z

All tests successful over 4 consecutive runs

fjetter · 2021-05-26T13:50:48Z

I left many comments but overall the code itself looks great. I also do like the clearly written tests since the expected behaviour is quite clear.

The two biggest issues I consider worth discussing are

Global constants / config on import time
Factor out _rebalance_find_msgs into dedicated functions to allow for scheduler/worker independent testing. I think this one is a bit of work but this could pay off nicely in the long run

crusaderky · 2021-05-27T11:25:16Z

distributed/scheduler.py

@@ -161,14 +163,6 @@ def nogil(func):
 DEFAULT_DATA_SIZE = declare(
    Py_ssize_t, parse_bytes(dask.config.get("distributed.scheduler.default-data-size"))
 )


could not move this as it is required by TaskState, and IMHO it would be too expensive to reload it on every TaskState.__init__

crusaderky · 2021-05-27T11:37:28Z

@fjetter

Global constants / config on import time

I redesigned this part. Now dask.config is read on Scheduler.__init__ and I think it looks a lot better.

Factor out _rebalance_find_msgs into dedicated functions to allow for scheduler/worker independent testing. I think this one is a bit of work but this could pay off nicely in the long run

While it is certainly possible to move _rebalance_find_msgs to a top-level function, it would require in input a SchedulerState object and a list of WorkerState objects, neither of which are easy - and most importantly, robust - to craft synthetically in a unit test. I'd much rather keep the tests as a black-box design because otherwise it's extremely easy that the real-life Scheduler and WorkerState objects will deviate from those crafted for the tests.

fjetter · 2021-05-27T14:50:20Z

While it is certainly possible to move _rebalance_find_msgs to a top-level function, it would require in input a SchedulerState object and a list of WorkerState objects, neither of which are easy - and most importantly, robust - to craft synthetically in a unit test.

We wouldn't need to pass the entire scheduler state in there, would we? I agree this would be nuts. From what I can see is that we're using four constants and the log_event method of the SchedulerState object. My point is that these constants can be a function input and we could simply not log this particular event. The event we're logging here is anyhow the "empty/nothing to do" event. I don't see a reason why this should be part of the core algorithm.

Passing in a list of WorkerState objects is exactly what I am proposing here and I believe it is more robust since we're not relying on futures, servers, comms, heartbeats, actual memory measurements or any timing relevant pieces. The tests would not even need to be sync/async since they would only test the actual algorithm.
Most attributes of the WorkerState object are irrelevant and we can run with the default values. The only actual coupling is introduced by the attributes we're using in the algorithm anyhow. Below an example how the test could look like

def test_rebalance_simple():
    A = WorkerState(name="A", memory_limit="1GB")
    B = WorkerState(name="B", memory_limit="1GB")

    def put_task_on_worker(ws, key, nbytes):
        ws._has_what[key] = None
        ws._nbytes += nbytes

    for ix in range(4):
        put_task_on_worker(A, f"k{ix}", 200)

    result = _rebalance_find_msgs(workers=[A, B])

    expected = [
        (A, B, "k0"),
        (A, B, "k1"),
    ]

    assert result == expected

fjetter

Ooops, had a few comments in 'pending', sorry

distributed/distributed-schema.yaml

distributed/tests/test_client.py

fjetter · 2021-05-26T13:06:16Z

distributed/tests/test_scheduler.py

+    """keys exist but belong to unfinished futures"""
+    futures = c.map(slowinc, range(10), delay=0.05, workers=a.address)
+    await asyncio.sleep(0.1)
+    out = await s.rebalance(keys=[f.key for f in futures])


There is a nother test for the internal mechanics which is waiting for the future. Why is this different for the user API?

The tests in test_client.py test specifically the wrapper in client.py. Previously, all tests for rebalance were exclusively in test_client.py. However, very soon rebalance will be chiefly invoked internally by the scheduler, bypassing the client entirely. Hence, the tests should invoke Scheduler.rebalance and not rely on the client wrapper.

As to why Client.rebalance and Scheduler.rebalance behave differently: it would be non-trivial to move the logic upstream (Client uses Futures, Scheduler uses keys, for which I couldn't find a straightforward wait method?) and I just could not justify the effort. If in the future the Scheduler will call rebalance internally on unfinished keys, we will revisit this, but I just cannot see a use case in the plan that we laid down.

Added clarification in the test that Client.rebalance and Scheduler.rebalance behave differently for this

Thanks. The scheduler indeed does not have a way to wait for keys. The closes thing to this would be if we were to allow registration of "transition callbacks", such that if a key is transitioned into a given state, a callback is invoked. However, let's just not do that, the state machine is too complex as it is.

distributed/scheduler.py

distributed/tests/test_scheduler.py

fjetter · 2021-05-26T13:38:58Z

distributed/tests/test_scheduler.py

+@set_config_and_reload({"distributed.worker.memory.rebalance.measure": "managed"})
+@gen_cluster(client=True, worker_kwargs={"memory_limit": 0})
+async def test_rebalance_skip_all_recipients(c, s, a, b):
+    """All recipients are skipped because they already hold copies"""


More a warning than an actual comment about this test. As soon as we hit a dynamic scheduler, replicas become much, much more complex since whether or not a task is actually replicated cannot be inferred from who_has. While who_has tracks copies of the data, this includes "temporary" copies on workers. the lifecycle of a tasks data is different for data which was calculated on a worker and data which is only there because it is a dependency of another task. Simply "skipping" these tasks might no longer be and the definition of the state (i.e. simply count data) will become more complex.

See also
#4784
#4772

fjetter · 2021-05-26T13:45:14Z

distributed/scheduler.py

+
+        # Fast exit in case no transfers are necessary or possible
+        if not senders or not recipients:
+            self.log_event(


This is actually the only place in this method where we're using SchedulerState / self. Why not factor this out into an independent function? This would make testing possible without actually invoking any scheduler/worker interaction and would allow us to write tests by defining measuremets (memory per worker) and assert the result w/out playing with actual memory measurements, scheduler<->worker heartbeats, comms, etc.

This would also futher decouple the move of the data from the decision making and might allow for easier testing of the latter

refactoring the function is straightforward, but I don't agree on doing it to begin with. Refer to discussion above

crusaderky · 2021-05-27T14:57:27Z

#4853 breaks this. Temporarily reverting the other PR to prove that tests pass.

crusaderky · 2021-05-27T15:49:17Z

Passing in a list of WorkerState objects is exactly what I am proposing here and I believe it is more robust

I strongly disagree. WorkerState objects are complex and in your code snippet you made very strong assumptions on their internal behaviour. If these internals changed in the future, I don't want a (potentially newbie) developer to have to hunt down unit tests that mock internal behaviour. What you're proposing has the big potential that, in a future PR completely unrelated to rebalance, rebalance will break without any of the unit tests noticing.

mrocklin · 2021-05-27T16:07:05Z

Passing in a list of WorkerState objects is exactly what I am proposing here and I believe it is more robust

I strongly disagree. WorkerState objects are complex and in your code snippet you made very strong assumptions on their internal behaviour. If these internals changed in the future, I don't want a (potentially newbie) developer to have to hunt down unit tests that mock internal behaviour. What you're proposing has the big potential that, in a future PR completely unrelated to rebalance, rebalance will break without any of the unit tests noticing.

Just to jump in here with historical context. Historically distributed had a lot of fine grained unit tests like this. I've heard them called "white box" tests because they expose the internals, unlike "black box" tests. In hindsight this ended up being a bad idea, especially as the scheduler changed internally. Small changes to Scheduler internals required changing dozens of tests. The tests became a liability and added inertia to development.

At one point I spent a week rewriting white box tests so that we generally only touched user-level API. This was hard because we needed to craft very specific situations where the desired behavior would arise. In hindsight though this approach has provided a ton of value. We're able to make large changes to internal scheduling state and the test suite remains valid.

fjetter · 2021-05-28T11:00:00Z

WorkerState objects are complex and in your code snippet you made very strong assumptions on their internal behaviour.

As you said, this is a snippet I haven't put much thought into. The only relevant bit of this snippet is that the actual modifications are encapsulated into a dedicated function such that there is one place to change things. I didn't want to argue about the very specific way to do this until we reached a conclusion about whether this is something we want to have or not. Please do not reject this idea based on that snippet.

What you're proposing has the big potential that, in a future PR completely unrelated to rebalance, rebalance will break without any of the unit tests noticing.

This assumes that there are no tests which cover the entire system and this is not something I propose. A bunch of the tests you wrote are still very important, exactly the way you wrote them, to know that everything works when put together.
I'm proposing to implement tests covering the particular logic of the rebalancing algorithm as just that. The rebalancing algorithm is the solution to a problem "given X workers with keys Y and weights W and hardware measurements Z, calculate the necessary transitions such that the variance of weights per worker is minimal". This is formally not 100% correct but my point is that this is a purely mathematical problem and as such already a hard problem worth having enough test coverage.

The how do we get the system into the proposed state is a hard problem by itself but less of mathematical nature. This problem is something where we need to deal with all the problems distributed systems introduce, like latencies, comm failures, dead nodes, race conditions, etc. Also a problem space where I think we should have enough edge case tests for non-happy paths.

Then there is the "plugging both pieces together" problem which doesn't require huge amounts of tests if both of the above subclasses are well tested.

At one point I spent a week rewriting white box tests so that we generally only touched user-level API. This was hard because we needed to craft very specific situations where the desired behavior would arise. In hindsight though this approach has provided a ton of value. We're able to make large changes to internal scheduling state and the test suite remains valid.

I get that. That's the classical test hierarchy problem and both from my experience and from literature I believe that a healthy balance is good.
This flexibility may have given us the opportunity to rewrite a lot of code but we also dropped the ball on a few things we haven't had the chance to clean up, yet. Most of my time with the distributed code base, I was trying very hard to construct situation like you described but it took me much more than a week and we're still having issues actually reproducing some cases, in particular connected to the various race conditions and deadlocks associated to the worker state machine but not exclusively.

The typical approach to this problem is to work with appropriate internal abstractions and (private) interfaces. If existing abstraction are not useful, let's create new ones or change existing ones. If the WorkerState is not a suitable interface, let's not use it. Maybe the WorkerState is indeed too complex and we should just provide raw data (e.g. a dict per worker with the necessary measurements) to the algorithm. After all, we definitely don't want the algorithm part to actually mutate the state. Ideally, we wouldn't want to touch the algorithm at all if we were to change the WorkerState so there is no reason why it needs to be exposed to the actual objects. If we wanted to stick with the WorkerState for performance and complexity reasons we might want to look into whether the operations I proposed above are universal enough that it could become a method of the WorkerState object, making it less fragile to be used in tests since it will then be used all over the scheduler. Maybe an entirely different approach is better but I'm very certain we will find a robust interface if we want to.
I'm actually more concerned with arguing that we want more of these thin internal interfaces than how they will particularly look like. In general I believe we should aim for having more of these since this allows for a less monolithic software architecture and allows us, if need be, to prepare a test system more easily for non-happy edge cases. This also helps with runtime and flakiness which is growing but that's only a secondary point.

crusaderky · 2021-05-28T12:30:18Z

Merged again from main after #4853 was fixed.
All tests pass.
All review comments with the exception of the refactoring have been addressed; @fjetter please let me know if there is anything else outstanding after my last few commits.

fjetter · 2021-05-28T13:25:45Z

Current test failures are (for reference)

#4859
#4839
#4862 (new but very likely unrelated)

I'll pass over the code once more but I don't expect something major to pop up. Let's have a chat about the possible refactoring this afternoon

fjetter

I had another look over the code and everything seems in order. Great job.

Regarding the refactoring, we couldn't reach a conclusion yet but I'm fine with merging as is. If we decide in favour of a refactoring, this will be additional work on top of this anyhow.

mrocklin · 2021-06-01T14:26:03Z

Woo!

…

On Tue, Jun 1, 2021 at 9:18 AM Florian Jetter ***@***.***> wrote: Merged #4774 <#4774> into main. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4774 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTB7RUCBW7YT22XCDA3TQTT25ANCNFSM4334HZQQ> .

crusaderky added 13 commits April 26, 2021 19:43

partial prototype

a756022

incomplete poc

36589b1

poc (incomplete)

9ac044a

Merge remote-tracking branch 'upstream/main' into rebalance

6e1ef79

complete POC

2a5b5cd

polish

60fe5a4

polish

a7e46b3

bugfix

5d267d3

fixes

b275d88

fix

c3fd176

Merge remote-tracking branch 'upstream/main' into rebalance

46aaea7

Use arbitrary measure in rebalance

c7e8ed6

Merge remote-tracking branch 'upstream/main' into rebalance

3454291

mrocklin reviewed May 3, 2021

View reviewed changes

distributed/distributed-schema.yaml Outdated Show resolved Hide resolved

fjetter reviewed May 5, 2021

View reviewed changes

crusaderky added 5 commits May 7, 2021 13:56

Merge branch 'main' into rebalance

ae27798

Code review

428fd8f

renames

f73ace8

suggest tweaking malloc_trim

1ad35ea

Merge remote-tracking branch 'upstream/main' into rebalance

c230c89

crusaderky commented May 10, 2021

View reviewed changes

crusaderky added 3 commits May 11, 2021 15:09

Merge remote-tracking branch 'upstream/main' into rebalance

cada411

self-review

1ad9d51

test_tls_functional

32a1f32

crusaderky closed this May 17, 2021

crusaderky reopened this May 17, 2021

crusaderky closed this May 17, 2021

crusaderky reopened this May 17, 2021

crusaderky added 2 commits May 27, 2021 10:31

Merge branch 'main' into rebalance

9704a34

reload dask.config on Scheduler.__init__

3f29a81

crusaderky commented May 27, 2021

View reviewed changes

Merge branch 'main' into rebalance

cfc4590

fjetter reviewed May 27, 2021

View reviewed changes

jrbourbeau mentioned this pull request May 27, 2021

Add HTML reprs for Client.who_has and Client.has_what #4853

Merged

crusaderky mentioned this pull request May 27, 2021

Expose malloc statistics giampaolo/psutil#1275

Open

crusaderky added 2 commits May 28, 2021 11:18

code review

03e376e

Merge branch 'main' into rebalance

f08185b

crusaderky force-pushed the rebalance branch from 5c021de to f08185b Compare May 28, 2021 10:50

fjetter approved these changes May 28, 2021

View reviewed changes

fjetter merged commit 9d4f0bf into dask:main Jun 1, 2021

crusaderky deleted the rebalance branch June 1, 2021 14:21

ncclementi mentioned this pull request Aug 12, 2021

array data without reference kept on workers? dask/dask#7212

Closed

crusaderky added the memory label Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

O(1) rebalance #4774

O(1) rebalance #4774

crusaderky commented Apr 30, 2021 •

edited

Loading

crusaderky commented Apr 30, 2021

crusaderky commented Apr 30, 2021

crusaderky commented Apr 30, 2021 •

edited

Loading

mrocklin commented May 3, 2021

fjetter commented May 5, 2021

xhochy commented May 5, 2021

fjetter left a comment

fjetter May 5, 2021

crusaderky May 7, 2021

crusaderky May 10, 2021 •

edited

Loading

crusaderky commented May 17, 2021

fjetter commented May 26, 2021

crusaderky May 27, 2021 •

edited

Loading

crusaderky commented May 27, 2021

fjetter commented May 27, 2021

fjetter left a comment •

edited

Loading

fjetter May 26, 2021

crusaderky May 27, 2021

crusaderky May 27, 2021

fjetter May 28, 2021

fjetter May 26, 2021

fjetter May 26, 2021

crusaderky May 27, 2021

crusaderky commented May 27, 2021

crusaderky commented May 27, 2021

mrocklin commented May 27, 2021

fjetter commented May 28, 2021

crusaderky commented May 28, 2021

fjetter commented May 28, 2021

fjetter left a comment

mrocklin commented Jun 1, 2021 via email

O(1) rebalance #4774

O(1) rebalance #4774

Conversation

crusaderky commented Apr 30, 2021 • edited Loading

In scope

Out of scope, left for future PRs

Demo

crusaderky commented Apr 30, 2021

crusaderky commented Apr 30, 2021

crusaderky commented Apr 30, 2021 • edited Loading

Important

mrocklin commented May 3, 2021

fjetter commented May 5, 2021

xhochy commented May 5, 2021

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky May 10, 2021 • edited Loading

Choose a reason for hiding this comment

crusaderky commented May 17, 2021

fjetter commented May 26, 2021

crusaderky May 27, 2021 • edited Loading

Choose a reason for hiding this comment

crusaderky commented May 27, 2021

fjetter commented May 27, 2021

fjetter left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented May 27, 2021

crusaderky commented May 27, 2021

mrocklin commented May 27, 2021

fjetter commented May 28, 2021

crusaderky commented May 28, 2021

fjetter commented May 28, 2021

fjetter left a comment

Choose a reason for hiding this comment

mrocklin commented Jun 1, 2021 via email

crusaderky commented Apr 30, 2021 •

edited

Loading

crusaderky commented Apr 30, 2021 •

edited

Loading

crusaderky May 10, 2021 •

edited

Loading

crusaderky May 27, 2021 •

edited

Loading

fjetter left a comment •

edited

Loading