Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Retrospective] Use of ContextVars for passing stimulus_id's within the Scheduler #6107

Open
sjperkins opened this issue Apr 12, 2022 · 0 comments

Comments

@sjperkins
Copy link
Member

sjperkins commented Apr 12, 2022

This issue is a retrospective describing lesson's learnt while implementing the transmission of Stimulus ID's within the Scheduler using Context Variables. Context Variables were introduced in Python 3.7. Quoting PEP-567:

This PEP proposes a new contextvars module and a set of new CPython C APIs to support context variables. This concept is similar to thread-local storage (TLS), but, unlike TLS, it also allows correctly keeping track of values per asynchronous task, e.g. asyncio.Task.

The potential of ContextVar's here is that there is less need to pass state around the Scheduler internals and thereby avoid invasive code changes. In particular, they seem to be a good fit for implementing stimulus_id transmission through the scheduler. The idea is that there would be a global STIMULUS_ID ContextVar set in strategic locations within the scheduler (i.e. RPC Handler functions). There would then be no need to pass stimuli as function arguments throughout the scheduler, leading to less invasive changes -- scheduler functions could just call STIMULUS_ID.get().

  • Stimulus ID's are a method of tracking which event generated a series of rpc calls within a dask/distributed cluster.
  • They form part of a larger distributed tracing effort within dask/distributed.
  • They are frequently generated by the client and passed to and fro between Scheduler and Workers via RPC Handlers.
  • In the context of Transition tracing for scheduler task transitions #5849, it is not important to distinguish between Client and Scheduler for an update_graph_hlg event -- stimuli are in practice generated on the Scheduler although in principle the event transmitting the graph occurred on the Client.

Issues

  • distributed scheduler handlers often call other handlers internally. update_graph_hlg calls update_graph, which in turn calls client_releases_keys. This implies that stimuli should not be overridden.
  • This can result in the following sort of coding style:
    def client_releases_keys(..., stimulus_id=None):
      try:
        token = None
        STIMULUS_ID.get()  # Perhaps the STIMULUS_ID ContextVar was set in update_graph
      except LookupError:
       # Perhaps it's was passed by an external entity, but perhaps it wasn't?
        token = STIMULUS_ID.set(stimulus_id or f"client-releases-keys")
    
      try:
        # Actual client_releases_keys implementation, now heavily indented
        ...
    
      finally:
        # We should try to reset the STIMULUS_ID once we leave the function
        # because client_releases_keys is called frequently in our test suite and we
        # want other functions to be able to set the STIMULUS_ID
        if token:
          STIMULUS_ID.reset(token)
  • As STIMULUS_ID is a global variable, it was important to reset the token due to the extensive use of rpc handler calls within the test suite. Without this, STIMULUS_ID would be permanently set on the first call to client_releases_keys, for example.
  • Writing Handlers, contextmanagers and decorators to wrap async handler functions was complicated by asyncio.iscoroutinefunction returns false for Cython async function objects cython/cython#2273

Approaches

  • A variety of approaches were attempted to handle this complexity
  • Support stimulus_id's via Wrapped RPC handlers and ContextVars #6010. Here, RPCHandler classes wrapped Scheduler functions. STIMULUS_ID was set if it was provided as a kwarg during the RPC call, otherwise it was automatically created based on the function name. However, wrapped RPCHandler classes were deemed to be an excessive change at this point in time.
  • Support Stimulus ID's in Scheduler with ContextVars #6046 tried substituting a default_stimulus_id contextmanager for RPCHandlers, but this introduced too much indentation into the existing code-base
  • Stimulus id contextvars explicit handlers #6083 attempted to create a very clear boundary between RPC handler functions and internal scheduler functionality by creating for e.g. handle_update_graph and handle_update_graph_hlg handlers which merely called the update_graph and update_graph_hlg functions respectively. This removed much of the complexity about deciding if a previously called handler had set the stimulus_id, but required explicitly calling handlers and setting stimulus_id's in the test suite.
  • A common necessity for all these changes was the need to call self.loop.addcallback(copy_context().run, handler). This is necessary because of the possibility of crossing a thread boundary and the need to copy contexts. In particular, Tornado doesn't always handle ContextVars resets correctly in versions prior to 6.1.

Result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant