Race conditions in implicit creation of worker clients when serializing futures resulting in `distributed.CancelledError`s #7498

fjetter · 2023-01-25T13:13:47Z

TLDR

Serializing futures as part of a dask collection can cause race conditions leading to spurious distributed.CancelledErrors or other entirely unrelated exceptions.

These can be particularly hard to debug since their cause and effect are decorrelated, i.e. task X can raise a CancelledError even though task Y was lost and a relation between X and Y is not apparent from the task graph.

I see a certain appeal of this feature but I am wondering how widely this is used. I am tempted to remove this feature (at least temporarily) from the code base entirely.

This issue came up during a closer review of #6028 and besides the consistency issues this feature currently has, it also adds complexity to how we serialize graphs.

cc @jrbourbeau @mrocklin @rjzamora does either one of you have more context about how commonly used this feature is and whether there are other cases I am missing? Feel free to skip the detailed explanation, I hope the example is sufficient to understand what I am talking about.

Example

(See test_client.py::test_serialize_collections for runnable code)

x = da.arange(10, chunks=(5,)).persist()

def f(x):
    assert isinstance(x, da.Array)
    return x.sum().compute()

future = c.submit(f, x)
result = await future
assert result == sum(range(10))

Alternative

This code can be replaced with the Client.{publish|get}_dataset functionality. This would require the user taking over explicit ownership of the futures/dataset and client and manage their lifecycle, thereby avoiding all the problems that are introduced by the current magic.

dataset_name = "foo"
x = await c.publish_dataset(**{dataset_name: da.arange(10, chunks=(5,))})

def f():
    from distributed import worker_client
    with worker_client() as c:
        x = c.get_dataset(dataset_name)
        assert isinstance(x, da.Array)
        return x.sum().compute()

future = c.submit(f)
result = await future
assert result == sum(range(10))
await c.unpublish_dataset(dataset_name)

Detailed explanation

Dask allows the serialization of client.Future objects iff they are embedded in a collection[1]. An example of this is (e.g. test_client.py::test_serialize_collections)

x = da.arange(10, chunks=(5,)).persist()

def f(x):
    assert isinstance(x, da.Array)
    return x.sum().compute()

future = c.submit(f, x)
result = await future
assert result == sum(range(10))

For instance, a collection is persisted. This causes the array collection task mapping to be materialized as a mapping from key to futures.
If we submit this as an argument to a task, the futures are serialized.

Since a future is owned by a client, we need to find a new owner for such a future upon deserialization. This is currently handled in __setstate__, see

distributed/distributed/client.py

Lines 464 to 478 in 0161991

    
           def __setstate__(self, state): 
        
               key, address = state 
        
               try: 
        
                   c = Client.current(allow_global=False) 
        
               except ValueError: 
        
                   c = get_client(address) 
        
               self.__init__(key, c) 
        
               c._send_to_scheduler( 
        
                   { 
        
                       "op": "update-graph", 
        
                       "tasks": {}, 
        
                       "keys": [stringify(self.key)], 
        
                       "client": c.id, 
        
                   } 
        
               )

This deserialization code is attempting to use a "current" client and falls back to worker.get_client which attempts to reuse an existing client but if this is not available, it will initialize a new one. [2]

So much for context. If this all would work flawlessly this would just be a bit of complex magic but the devil's in the details.

Specifically, the way the client is being implicitly initialized can cause spurious CancelledErrors

distributed/distributed/worker.py

Lines 2572 to 2583 in 0161991

    
           asynchronous = in_async_call(self.loop) 
        
           self._client = Client( 
        
               self.scheduler, 
        
               loop=self.loop, 
        
               security=self.security, 
        
               set_as_default=True, 
        
               asynchronous=asynchronous, 
        
               direct_to_workers=True, 
        
               name="worker", 
        
               timeout=timeout, 
        
           ) 
        
           Worker._initialized_clients.add(self._client)

The initialization here starts by inferring whether or not we're in an async call,

distributed/distributed/utils.py

Lines 343 to 352 in 0161991

    
           def in_async_call(loop, default=False): 
        
               """Whether this call is currently within an async call""" 
        
               try: 
        
                   return loop.asyncio_loop is asyncio.get_running_loop() 
        
               except RuntimeError: 
        
                   # No *running* loop in thread. If the event loop isn't running, it 
        
                   # _could_ be started later in this thread though. Return the default. 
        
                   if not loop.asyncio_loop.is_running(): 
        
                       return default 
        
                   return False

This in_async_call is misleading since it does not actually tell you if we're awaiting the callable but rather weather we are in the main thread or not. Therefore, the client will always be initialized asynchronously (i.e. I believe this check is superfluous)

This Client is never awaited. It will schedule a future on the event loop that is never awaited but will eventually connect the client to the scheduler. If the connection attempt fails, this results in a raised CancelledError as soon as the user attempts to fetch the futures results.

distributed/distributed/client.py

Line 1186 in 0161991

self._started = asyncio.ensure_future(self._start(**kwargs))

Another failure case is if the future that is being serialized is forgotten prematurely. Since the worker only tells the scheduler about its interest in these keys upon deserialization, there is a significant time period in which the initial client could have released the tasks. Simplest example is if taking the original and deleting the reference after submission

    x = da.arange(10, chunks=(5,)).persist()

    def f(x):
        assert isinstance(x, da.Array)
        return x.sum().compute()

    future = c.submit(f, x)
    del x
    result = await future

I expected this to cause a CancelledError (and indeed I can see a FutureState.cancel call with a debugger) but instead I get an entirely unrelated exception from the array code numpy.AxisError: axis 0 is out of bounds for array of dimension 0.
My suspicion is that cancelling an internal future of a collection causes weird side effects.

[1] Futures that are not embedded in a collection are not serialized. Instead, their key is tracked as a dependency and passed to the user function materialized.

[2] Note: the first try/except to get a current(allow_global=False) is not very helpful for a couple of reasons. Firstly, during deserialization it is almost impossible to actually have a current client running (as in with client.as_current()), i.e. this is almost guaranteed to fail. Secondly, the get_client that is called as an exception handler is allowed to access a global/default client, i.e. disabeling this in the initial attempt is misleading.

The text was updated successfully, but these errors were encountered:

fjetter · 2023-01-25T14:11:28Z

I am a bit disappointed seeing that dataset_publish also relies on the serializability of the futures. I expected something like __dask_postpersist__ to provide a static method to construct the object from a set of futures but unfortunately that is not true

rjzamora · 2023-01-25T14:24:17Z

cc @madsbk (for viz)

fjetter · 2023-01-25T14:26:37Z

FWIW I believe the original __dask_postpersist__ output as introduced in dask/dask#2748 was static, e.g.

    def __dask_postpersist__(self):
        return Array, (self.name, self.chunks, self.dtype)

does not contain any references to the instance. this is basically a type and some literals, similar to what __reduce__ would return

nowadays, it looks like

    def __dask_postpersist__(self):
        return self._rebuild, ()

    def _rebuild(self, dsk, *, rename=None):
        name = self._name
        if rename:
            name = rename.get(name, name)
        return Array(dsk, name, self.chunks, self.dtype, self._meta)

which directly references the instance via the self.rebuild. I don't think that's truly necessary

fjetter · 2023-01-25T14:30:03Z

Looks like this self reference was introduced in dask/dask#7142

fjetter · 2023-01-25T15:03:29Z

FWIW to avoid the race above it is not necessary to forbid serialization entirely. We just have to disallow initialization of a new client.

fjetter · 2023-01-25T17:31:58Z

There is a similar pattern with scatter, e.g.

    ddf = dd.from_pandas(df, npartitions=2).persist()
    future = await c.scatter(ddf)

I don't fully understand what this is intended to do. why would we want to scatter the futures/a persisted collection?

fjetter · 2023-01-25T17:40:31Z

Looks like this functionality was introduced in #1201 This was a major PR in ancient past that changed how clients were initialized or reused. From the discussion there I gather these tests (test_serialize_collections_of_futures_sync and test_serialize_collections_of_futures) were mostly introduced as sanity checks but not necessarily to support a particular user behavior.

fjetter · 2023-01-25T18:45:44Z

I opened #7500 which would remove the possibility to pass collections of futures as arguments by removing the implicit client instantiation.

jrbourbeau · 2023-02-02T04:14:36Z

does either one of you have more context about how commonly used this feature is and whether there are other cases I am missing?

I'm not sure how widely this is used. I know xgboost.dask does some interesting things with persisted collections. It's probably worth running a toy xgboost.dask example against a PR that removes this functionality just to confirm it doesn't break things.

@mrocklin are you aware of other use cases where persisted collections are used as inputs to a client.submit call?

It does look a bit off. We don't support dask-in-dask well today, so it seems strange that we are intentionally supporting it in this one case (where the collection has been persisted)

fjetter mentioned this issue Jan 25, 2023

Forbid collections of futures to be passed as arguments #7500

Closed

fjetter mentioned this issue Feb 23, 2023

Future deserialization without available client #7580

Merged

fjetter mentioned this issue Apr 5, 2023

Ensure proper exception is raised if wrapped future cannot be initial… #7748

Open

fjetter mentioned this issue Aug 16, 2024

Ensure client_desires_keys does not corrupt Scheduler state #8827

Merged

fjetter added a commit to fjetter/distributed that referenced this issue Aug 20, 2024

Add test for dask#7498

4cfcc22

fjetter closed this as completed in #8827 Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions in implicit creation of worker clients when serializing futures resulting in `distributed.CancelledError`s #7498

Race conditions in implicit creation of worker clients when serializing futures resulting in `distributed.CancelledError`s #7498

fjetter commented Jan 25, 2023 •

edited

Loading

fjetter commented Jan 25, 2023

rjzamora commented Jan 25, 2023

fjetter commented Jan 25, 2023 •

edited

Loading

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

jrbourbeau commented Feb 2, 2023

Race conditions in implicit creation of worker clients when serializing futures resulting in distributed.CancelledErrors #7498

Race conditions in implicit creation of worker clients when serializing futures resulting in distributed.CancelledErrors #7498

Comments

fjetter commented Jan 25, 2023 • edited Loading

TLDR

Example

Alternative

Detailed explanation

fjetter commented Jan 25, 2023

rjzamora commented Jan 25, 2023

fjetter commented Jan 25, 2023 • edited Loading

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

fjetter commented Jan 25, 2023

jrbourbeau commented Feb 2, 2023

Race conditions in implicit creation of worker clients when serializing futures resulting in `distributed.CancelledError`s #7498

Race conditions in implicit creation of worker clients when serializing futures resulting in `distributed.CancelledError`s #7498

fjetter commented Jan 25, 2023 •

edited

Loading

fjetter commented Jan 25, 2023 •

edited

Loading