Client and worker_client refactor #1201

mrocklin · 2017-06-23T04:07:53Z

This is a major rewrite of how we handle tasks that create clients for complex dynamic workloads.

Previously we would start up a new client every time someone used worker_client

def func():
    with worker_client() as c:
        ...

We change this in two large ways:

In API we split this into two functions, get_client and secede.
In implementation we keep around the same client forever

Reusing the same client and requesting the same client from many threads required us to be a bit clever locks, which is probably error prone. This also exposed some thread-unsafety issues in our ThreadPoolExecutor fork.

This also removes channels. It was a pain to maintain the same behavior when all of the worker_clients were in fact the same client.

mrocklin · 2017-06-30T13:10:39Z

Rebased to more sensible commits. Going to merge this soon.

adamklein · 2017-06-30T16:20:34Z

Can I nest future objects into say a dictionary? E.g, instead of client.submit(f, x) where x is a future, can I do client.submit(f, {'x': x})? I tried this approach and get:

Exception ignored in: <object repr() failed>                                                                                                                   
Traceback (most recent call last):                                                                                                                            
  File "/home/aklein/src/distributed/distributed/client.py", line 282, in __del__                              
    self.release()                                       
  File "/home/aklein/src/distributed/distributed/client.py", line 267, in release                           
   with self.client._lock:       
AttributeError: 'Future' object has no attribute 'client'

mrocklin · 2017-06-30T17:40:33Z

This brings up a good question of when do we want to materialize futures to track dependencies and when do we want to serialize them without tracking dependencies.

In the case you describe above I would actually expect that to create a dependency on x, not serialize it.

In the case you brought up on Stack Overflow it's actually not clear to me what best behavior is here. Do we want to finalize the result and create a dependency or move around the collection-holding-futures.

mrocklin · 2017-06-30T17:45:16Z

Here is the behavior I get when trying your example:

In [1]: from dask.distributed import Client

In [2]: client = Client()
 
In [3]: future = client.scatter(123)

In [4]: future2 = client.submit(lambda x: x, {'x': future})

In [5]: future2.result()
Out[5]: {'x': 123}

adamklein · 2017-06-30T18:41:25Z

I guess I'm a little confused by what we mean when we say support serialization of dask collections or a future. For instance, this doesn't work for a dask dataframe, but it does for a concrete value.

def task1():
	return pd.DataFrame(np.random.rand(10,10))


def task2(df):
	return df

def testme():
	with Client() as client:
		ddf = dd.from_delayed([client.submit(task1)])
		ddf = client.persist(ddf)
		fut = client.scatter(ddf)
		#fut = client.scatter(123)
		fut2 = client.submit(task2, fut)
		print(client.compute(fut2, sync=True))

because i get

ValueError: No global client found

What I want to do is send the handle to the ddf or future to the worker client and manipulate it there. Is this still not possible at the moment? I'd be happy with a way to wrap ddf so it's like client.submit(task2, serialize(ddf)) - do I have access to such a magical serialize function?

adamklein · 2017-06-30T18:44:11Z

Also, why does this fail?

def task1():
	return pd.DataFrame(np.random.rand(10,10))


def task2(df):
	return df


def testme():
	with Client() as client:
		ddf = dd.from_delayed([client.submit(task1)])
		fut2 = client.submit(task2, ddf)
		print(client.compute(fut2, sync=True))

adamklein · 2017-06-30T18:49:51Z

I guess even more fundamentally, I thought I'm replicating your test here, but this fails with the same ValueError: No global client found:

    with Client() as client:
        import dask.array as da
        x = da.arange(10, chunks=(5,)).persist()

        def f(x):
            assert isinstance(x, da.Array)
            return x.sum().compute()

        future = client.submit(f, x)
        result = client.compute(future, sync=True)
        print(result)

adamklein · 2017-06-30T20:17:11Z

This almost works:

        client1 =>

        ddf = dd.from_delayed(dfs)
        ddf = client.persist(ddf)

        # hold reference to future so it isn't ejected from memory
        from distributed.protocol.pickle import dumps
        storage[ddf_name] = (ddf, dumps(ddf))
        ....
        worker client =>

        from distributed.protocol.pickle import loads
        ddf = loads(ddf_string)
        ....

Sometimes it falls over with:

Traceback (most recent call last):                                                                                                                                   
      ....                                                                                                                 
  File "/home/aklein/src/distributed/distributed/protocol/pickle.py", line 59, in loads                                                                                                                      
	return pickle.loads(x)                                                                                                                                                                                                         
AttributeError: Can't get attribute 'apply_and_enforce' on <module 'dask.dataframe.core' from '/home/aklein/src/dask/dask/dataframe/core.py'>

I know I'm doing some weird stuff, but it feels very close :)

EDIT: if I throw in strategically placed time.sleep(0.2), I can get it pretty stable. Ha. This is so hacky.

mrocklin · 2017-06-30T20:51:10Z

It looks like clients aren't being created on demand. I'll need to take a look. This probably won't happen any time today though. I appreciate the failing tests.

adamklein · 2017-06-30T20:53:44Z

No problem, appreciate the work, it's very promising! Have a great pre-4th of july weekend.

mrocklin · 2017-07-01T13:47:07Z

@adamklein the recent commits may interest you

mrocklin · 2017-07-03T13:18:16Z

This now feels ready to me. @adamklein if you have a chance to try to break this or @pitrou if you have a chance to review I would appreciate it.

mrocklin · 2017-07-03T13:19:32Z

I'm considering where to put documentation for this. I may want to put this in the dask/dask repository.

adamklein · 2017-07-03T13:45:25Z

Giving it a look this morning!

adamklein · 2017-07-03T14:59:30Z

This is looking excellent. I'm running into timeout issues on my large task return values but I think that's the problem I hacked before that I simply lost b/c of the worker_client refactor, and I just have to throw those timeout hacks back in there. I think it's a result of my scheduler being overworked and unable to communicate in a timely fashion. Running locally I see it taking a lot of CPU and memory. Here is that stack trace just for information purposes, but again, think it has nothing to do with your changes (or e.g. the changes regarding de/serialization on a separate thread).

Traceback (most recent call last):                                                                                                                                  
  File "/lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/comm/core.py", line 181, in connect         
	quiet_exceptions=EnvironmentError)                                                                                                                               
  File "/lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                                
	value = future.result()                                                                                                                                          
  File "/lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                                                        
	raise_exc_info(self._exc_info)                                                                                                                                                                          
  File "<string>", line 4, in raise_exc_info                                                                                                                                                                         
tornado.gen.TimeoutError: Timeout                                                                                                                                  
																																									 
During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                                          
																																																																															 
Traceback (most recent call last):                                                                                                                                                                                                                                                                                           
	result = f.result()                                                                                                                                            
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/client.py", line 152, in result             
	self._result, raiseit=False, callback_timeout=timeout)                                                                                                         
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/utils.py", line 234, in sync               
	six.reraise(*error[0])                                                                                                                                           
  File "lib/python3.5/site-packages/six.py", line 686, in reraise                                                                     
	raise value                                                                                                                                                    
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/utils.py", line 223, in f                 
	result[0] = yield make_coro()                                                                                                                                   
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                               
	value = future.result()                                                                                                                                         
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                      
	raise_exc_info(self._exc_info)                                                                                                                                  
  File "<string>", line 4, in raise_exc_info                                                                                                                         
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run                                                               
	yielded = self.gen.throw(*exc_info)                                                                                                                             
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/client.py", line 177, in _result           
	result = yield self.client._gather([self])                                                                                                                       
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                               
	value = future.result()                                                                                                                                       
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                      
	raise_exc_info(self._exc_info)                                                                                                                                  
  File "<string>", line 4, in raise_exc_info                                                                                                                         
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run                                                                
	yielded = self.gen.throw(*exc_info)                                                                                                                              
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/client.py", line 1245, in _gather           
	response = yield self.scheduler.gather(keys=keys)                                                                                                                                                 
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                                                                       
	value = future.result()                                                                                                                                                                                          
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                     
	raise_exc_info(self._exc_info)                                                                                                                                   
  File "<string>", line 4, in raise_exc_info                                                                                                                         
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run                                                                
	yielded = self.gen.throw(*exc_info)                                                                                                                              
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/core.py", line 423, in send_recv_from_rpc    
	comm = yield self.live_comm()                                                                                                                                     
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                                
	value = future.result()                                                                                                                                           
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                        
	raise_exc_info(self._exc_info)                                                                                                                                   
  File "<string>", line 4, in raise_exc_info                                                                                                                         
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run                                                                
	yielded = self.gen.throw(*exc_info)                                                                                                                              
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/core.py", line 399, in live_comm            
	connection_args=self.connection_args)                                                                                                                           
  File "lib/python3.5/site-packages/tornado/gen.py", line 1055, in run                                                               
	value = future.result()                                                                                                                                          
  File "lib/python3.5/site-packages/tornado/concurrent.py", line 238, in result                                                      
	raise_exc_info(self._exc_info)                                                                                                                                  
  File "<string>", line 4, in raise_exc_info                                                                                                                         
  File "lib/python3.5/site-packages/tornado/gen.py", line 1063, in run                                                                
	yielded = self.gen.throw(*exc_info)                                                                                                                            
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/comm/core.py", line 190, in connect         
	_raise(error)                                                                                                                                                   
  File "lib/python3.5/site-packages/distributed-1.17.1+36.g007cbf73-py3.5.egg/distributed/comm/core.py", line 173, in _raise         
	raise IOError(msg)                                                                                                                                              
OSError: Timed out trying to connect to 'tcp://...:8786' after 5 s: connect() didn't finish in time

mrocklin · 2017-07-03T15:04:03Z

Do you have a sense for what is slowing down your scheduler? It seems odd to me to see 5s of delay, even under very heavy load (thousands of workers).

You may be interested in this comment.

mrocklin · 2017-07-03T15:07:01Z

Documentation added in dask/dask#2501

pitrou · 2017-07-03T17:03:22Z

distributed/client.py

        from distributed.recreate_exceptions import ReplayExceptionClient
        ReplayExceptionClient(self)

+    @property
+    def asynchronous(self):


I think either the name of this property and/or its semantics and/or its docstring should be changed, since all three seem to be saying different things :-)

I've expanded the docstring. Maybe this helps? Suggestions welcome.

pitrou · 2017-07-03T17:05:50Z

distributed/client.py

-        if asynchronous:
+        self.status = 'connecting'
+
+        if self.asynchronous:


I must admit I fail to understand the intent of the .asynchronous property. If the user passed asynchronous=True, we may still start an event loop in a separate thread?

The problem arises because we reuse the same client in both contexts. Consider the following:

def g(): # I run in a worker thread client = get_client() future = client.submit(inc, 1) return future.result() async def f(): async with Client(asynchronous=True, processes=False) as c: await c.submit(g)

The client in both cases is the same. We tell the client originally that it is operating in asynchronous mode. However when it finds itself in the g task it realizes that it definitely can't be in asynchronous mode and so switches over within that context.

Ideally we would have a function that could tell us if we "were in the event loop" but this isn't doable. So instead we take a signal from the user (this is asynchronous=True) but disregard it when it's obviously false.

This approach would currently fail if we were to start a synchronous client and then try to use it asynchronously, but seems to work in all other settings.

The client in both cases is the same.

Does it have to? :-) I don't think it would be terribly ruinous to restrict that asynchronous clients cannot be re-used in a synchronous context, and vice-versa.

There is some value to reducing the number of active clients. I agree though that, especially because of the failure of synchronous-intended clients to work in asynchronous environments, we do need to resolve this.

I think I'm still hoping for a way to help identify whether we should or should not act in an asynchronous manner that is separate from defining a client as asynchronous or not.

My inclination here is to merge things roughly as they are now. I think that the user-facing bits of this are decent. We have a known failure in one case but it's unlikely to be encountered by most users. This PR is large and I'd like to get it in soon-ish. I don't have strong confidence that I can find a good solution in the next day or two with other things going on.

One issue that arises is that when we deserialize futures or queues or variables we don't know if the client should be asynchronous or synchronous. The user doesn't have a chance to tell us in this case.

mrocklin · 2017-07-03T18:43:54Z

@pitrou any response on the asynchronous comments? In general I agree that this isn't entirely clean. I do think it's significantly cleaner than before. I'm not sure yet how to improve on the situation.

adamklein · 2017-07-03T20:54:33Z

@mrocklin I am actually seeing a major memory leak in the scheduler on my example in which I am beating up the scheduler. That seems to be what is leading to things to fall over. I rolled back to 1.17.1 and it disappears. I'll try to bisect to figure out on what commit it happens. I don't think I'll be able to get to it today though - most likely 7/5

mrocklin · 2017-07-03T20:55:42Z

I appreciate you digging into it. Let us know what you find

adamklein · 2017-07-04T02:44:57Z

I did a git bisect and 068c571 is the commit where my dask stress test starts to make the scheduler blow out memory (and stress the CPU to boot).

mrocklin · 2017-07-04T13:23:38Z

@adamklein can you say a bit more about what your stress test is doing? For example, it would be useful to know if it is scatter/gathering a large amount of data between worker-clients.

mrocklin · 2017-07-04T15:43:19Z

@adamklein if you can try this branch again with the recent commit I'd be obliged. No rush.

…

On Tue, Jul 4, 2017 at 3:13 AM, Antoine Pitrou ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In distributed/client.py <#1201 (comment)>: > pc = PeriodicCallback(lambda: None, 1000, io_loop=self.loop) self.loop.add_callback(pc.start) _set_global_client(self) - if asynchronous: + self.status = 'connecting' + + if self.asynchronous: The client in both cases is the same. Does it have to? :-) I don't think it would be terribly ruinous to restrict that asynchronous clients cannot be re-used in a synchronous context, and vice-versa. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1201 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLM92BkWZi11jixUYEQ1gXCYsnJyks5sKeYUgaJpZM4ODHNX> .

adamklein · 2017-07-04T21:59:12Z

@mrocklin Unfortunately my example seems to still leak memory and run at high CPU with your latest commit. I'm seeing about 10x performance drop in CPU and memory usage so that I can't do what I used to. I have tasks launching tasks and passing 100Mb's to Gbs as task results between workers. This is certainly an example I'd like to show you offline when everything is in place.

mrocklin · 2017-07-05T12:19:18Z

@adamklein it would be interesting to see if the scheduler is reporting high communication volume. There is a plot at http://scheduler-address:8787/system that might be of interest

adamklein · 2017-07-05T14:10:40Z

@mrocklin The answer is yes, there is high communication volume. Here are my snapshots: off current dask master =>

Off the latest PR of #1201 =>

adamklein · 2017-07-05T14:20:59Z

PS I am using on_completed, which I see calls into self.client._gather([self]) around line 177 of client.py; this has default parameter direct=False, and probably local_worker=None as well. I assume we want either direct=True or local_worker=(something). Granted this is just from reading code, I haven't done any tracing so I could be totally off base here.

adamklein · 2017-07-05T14:32:31Z

PPS, I just confirmed if I put direct=True into self.client._gather([self], direct=True) in the above mentioned line in client.py, the problem disappears. I don't know if that has any negative consequences, but it reverts the behavior to what happens on the master branch in terms of memory and bandwidth utilization.

These weren't heavily used and where they were used they are now generally replacable with Queues, which are cleaner.

This caused unpleasant warnings in tests otherwise

These separate and optionally replace worker_client. Additionally this required additional logic to handling default clients and workers, serialization of futures, queues, and variables. This created functionality and tests that also triggered changes to how the scheduler tracks long-running tasks.

mrocklin · 2017-07-05T18:42:36Z

I have rebased commits to be somewhat more clean. Merging on passed tests.

mrocklin · 2017-07-05T18:51:14Z

PPS, I just confirmed if I put direct=True into self.client._gather([self], direct=True) in the above mentioned line in client.py, the problem disappears. I don't know if that has any negative consequences, but it reverts the behavior to what happens on the master branch in terms of memory and bandwidth utilization.

It looks like Future.result would suffer from this as well. I'll try to resolve it in another PR after this gets merged.

mrocklin · 2017-07-05T18:51:24Z

@adamklein thanks for tracking that down.

mrocklin changed the title ~~Merge WorkerClient into Client~~ Client and worker_client refactor Jun 23, 2017

mrocklin mentioned this pull request Jun 24, 2017

Timeout on worker_client() acquisition #1188

Closed

mrocklin force-pushed the client-multi-thread branch 4 times, most recently from 9cd47a0 to c482831 Compare June 29, 2017 17:53

mrocklin mentioned this pull request Jun 30, 2017

Add fire_and_forget function #1221

Merged

mrocklin force-pushed the client-multi-thread branch from e9e1884 to 1b54a4a Compare June 30, 2017 13:09

mrocklin force-pushed the client-multi-thread branch from 5676ae8 to 02d3657 Compare June 30, 2017 14:55

mrocklin force-pushed the client-multi-thread branch from 02d3657 to 5681aac Compare June 30, 2017 17:43

adamklein mentioned this pull request Jun 30, 2017

Sporadic concurrency issue leading to zero task progress #1176

Closed

mrocklin force-pushed the client-multi-thread branch from 5681aac to aea7ac7 Compare June 30, 2017 20:47

mrocklin force-pushed the client-multi-thread branch from f9e7a6e to 2e977b5 Compare July 2, 2017 18:38

pitrou reviewed Jul 3, 2017

View reviewed changes

adamklein mentioned this pull request Jul 3, 2017

Erroring on workers applying method over large Dask df #1225

Closed

mrocklin force-pushed the client-multi-thread branch from ce1057d to 7414fb2 Compare July 3, 2017 17:36

mrocklin mentioned this pull request Jul 3, 2017

Create joblib.rst #1171

Closed

mrocklin mentioned this pull request Jul 5, 2017

Better GPGPU integration #1226

Open

Make ThreadPoolExecutor threadsafe to secede operation

fddb0e1

mrocklin force-pushed the client-multi-thread branch from 5d68103 to 321b019 Compare July 5, 2017 18:39

mrocklin added 4 commits July 5, 2017 14:41

Remove channels

84c586e

These weren't heavily used and where they were used they are now generally replacable with Queues, which are cleaner.

Add close method to client

4967568

Add metadata to dataframe tests

f389ac6

This caused unpleasant warnings in tests otherwise

mrocklin force-pushed the client-multi-thread branch from 321b019 to fe2adcb Compare July 5, 2017 18:42

mrocklin merged commit afe66d5 into dask:master Jul 5, 2017

mrocklin deleted the client-multi-thread branch July 5, 2017 21:27

fjetter mentioned this pull request Jan 25, 2023

Race conditions in implicit creation of worker clients when serializing futures resulting in distributed.CancelledErrors #7498

Closed

Client and worker_client refactor #1201

Client and worker_client refactor #1201

Conversation

mrocklin commented Jun 23, 2017 • edited Loading

mrocklin commented Jun 30, 2017

adamklein commented Jun 30, 2017 • edited Loading

mrocklin commented Jun 30, 2017

mrocklin commented Jun 30, 2017

adamklein commented Jun 30, 2017

adamklein commented Jun 30, 2017

adamklein commented Jun 30, 2017

adamklein commented Jun 30, 2017 • edited Loading

mrocklin commented Jun 30, 2017

adamklein commented Jun 30, 2017

mrocklin commented Jul 1, 2017

mrocklin commented Jul 3, 2017

mrocklin commented Jul 3, 2017

adamklein commented Jul 3, 2017

adamklein commented Jul 3, 2017 • edited Loading

mrocklin commented Jul 3, 2017

mrocklin commented Jul 3, 2017

pitrou Jul 3, 2017

Choose a reason for hiding this comment

mrocklin Jul 3, 2017

Choose a reason for hiding this comment

pitrou Jul 3, 2017

Choose a reason for hiding this comment

mrocklin Jul 3, 2017

Choose a reason for hiding this comment

pitrou Jul 4, 2017

Choose a reason for hiding this comment

mrocklin Jul 4, 2017

Choose a reason for hiding this comment

mrocklin Jul 5, 2017

Choose a reason for hiding this comment

mrocklin commented Jul 3, 2017

adamklein commented Jul 3, 2017

mrocklin commented Jul 3, 2017

adamklein commented Jul 4, 2017 • edited Loading

mrocklin commented Jul 4, 2017

mrocklin commented Jul 4, 2017 via email

adamklein commented Jul 4, 2017

mrocklin commented Jul 5, 2017

adamklein commented Jul 5, 2017

adamklein commented Jul 5, 2017

adamklein commented Jul 5, 2017

mrocklin commented Jul 5, 2017

mrocklin commented Jul 5, 2017

mrocklin commented Jul 5, 2017

mrocklin commented Jun 23, 2017 •

edited

Loading

adamklein commented Jun 30, 2017 •

edited

Loading

adamklein commented Jun 30, 2017 •

edited

Loading

adamklein commented Jul 3, 2017 •

edited

Loading

adamklein commented Jul 4, 2017 •

edited

Loading