Store text error messages and tracebacks in scheduler #5126

mrocklin · 2021-07-27T14:33:43Z

Currently the worker captures all exceptions and tracebacks and puts them into the local TaskStates and also ships these up to the scheduler in serialized form so that they can be passed down to the client to be replayed. This is very helpful to allow users to understand remote errors.

However sometimes relying on the client in order to see these exceptions doesn't work well. For example a computation may get stuck somewhere. In these situations it would be nice to rely on the Dask dashboard to present these errors to us nicely. Unfortunately this is hard because these exceptions are in serialized form, and even if we felt comfortable deserializing them they may rely on libraries or hardware that we don't have (like GPUs).

As a solution I suggest that we push up and store text versions of the exception and traceback along with their serialized forms.

Implementation plan

Fortunately we use the distributed.core.error_message function fairly consistently to create exceptions and tracebacks. This function already includes a "text" field that is a string representation of the exception. We should probably add a "traceback-text" field (maybe renaming "text" to "exception-text") for the traceback (TODO, figure out how to stringify the frames in the traceback (see distributed.profile.call_stack)).

Then, in a few places like Worker.execute and Worker.transition_executing_done we see that we call this function and assign the exception and traceback attributes

                ts.exception = result["exception"]
                ts.traceback = result["traceback"]

And then in Worker.send_task_state_to_scheduler we ship the information of a worker task state up to the scheduler

        elif ts.exception is not None:
            d = {
                "op": "task-erred",
                "status": "error",
                "key": ts.key,
                "thread": self.threads.get(ts.key),
                "exception": ts.exception,
                "traceback": ts.traceback,
            }

Probably in these cases we should also include the text variants.

This sends a message to the "task-erred" handler, which ends up in Scheduler.stimulus_task_erred, which calls transition, which probably ends up in transition_processing_erred, which eventually stores the exception on the TaskState.

            ts._erred_on.add(w or worker)
            if exception is not None:
                ts._exception = exception
            if traceback is not None:
                ts._traceback = traceback

Anyway, we need to follow the thread of the exception information through the worker-scheduler machinery, and then store the text versions on the TaskState.

After that's done

We should probably just merge that in. But after that's in we should start thinking about how to present that information to the user. There are a variety of places in the dashboard that we could put this.

The text was updated successfully, but these errors were encountered:

Fixes dask#5126

Fixes #5126

mrocklin mentioned this issue Jul 27, 2021

Highlight exceptions in Task Groups visualization #4979

Open

mrocklin added a commit to mrocklin/distributed that referenced this issue Jul 30, 2021

Add text exceptions to the Scheduler

0527db0

Fixes dask#5126

mrocklin mentioned this issue Jul 30, 2021

Add text exceptions to the Scheduler #5148

Merged

mrocklin closed this as completed in #5148 Aug 10, 2021

mrocklin added a commit that referenced this issue Aug 10, 2021

Add text exceptions to the Scheduler (#5148)

a742de4

Fixes #5126

fjetter mentioned this issue Aug 18, 2021

Worker has no exceptions attribute anymore #5225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store text error messages and tracebacks in scheduler #5126

Store text error messages and tracebacks in scheduler #5126

mrocklin commented Jul 27, 2021

Store text error messages and tracebacks in scheduler #5126

Store text error messages and tracebacks in scheduler #5126

Comments

mrocklin commented Jul 27, 2021

Implementation plan

After that's done