You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the worker captures all exceptions and tracebacks and puts them into the local TaskStates and also ships these up to the scheduler in serialized form so that they can be passed down to the client to be replayed. This is very helpful to allow users to understand remote errors.
However sometimes relying on the client in order to see these exceptions doesn't work well. For example a computation may get stuck somewhere. In these situations it would be nice to rely on the Dask dashboard to present these errors to us nicely. Unfortunately this is hard because these exceptions are in serialized form, and even if we felt comfortable deserializing them they may rely on libraries or hardware that we don't have (like GPUs).
As a solution I suggest that we push up and store text versions of the exception and traceback along with their serialized forms.
Implementation plan
Fortunately we use the distributed.core.error_message function fairly consistently to create exceptions and tracebacks. This function already includes a "text" field that is a string representation of the exception. We should probably add a "traceback-text" field (maybe renaming "text" to "exception-text") for the traceback (TODO, figure out how to stringify the frames in the traceback (see distributed.profile.call_stack)).
Then, in a few places like Worker.execute and Worker.transition_executing_done we see that we call this function and assign the exception and traceback attributes
Probably in these cases we should also include the text variants.
This sends a message to the "task-erred" handler, which ends up in Scheduler.stimulus_task_erred, which calls transition, which probably ends up in transition_processing_erred, which eventually stores the exception on the TaskState.
Anyway, we need to follow the thread of the exception information through the worker-scheduler machinery, and then store the text versions on the TaskState.
After that's done
We should probably just merge that in. But after that's in we should start thinking about how to present that information to the user. There are a variety of places in the dashboard that we could put this.
The text was updated successfully, but these errors were encountered:
Currently the worker captures all exceptions and tracebacks and puts them into the local TaskStates and also ships these up to the scheduler in serialized form so that they can be passed down to the client to be replayed. This is very helpful to allow users to understand remote errors.
However sometimes relying on the client in order to see these exceptions doesn't work well. For example a computation may get stuck somewhere. In these situations it would be nice to rely on the Dask dashboard to present these errors to us nicely. Unfortunately this is hard because these exceptions are in serialized form, and even if we felt comfortable deserializing them they may rely on libraries or hardware that we don't have (like GPUs).
As a solution I suggest that we push up and store text versions of the exception and traceback along with their serialized forms.
Implementation plan
Fortunately we use the
distributed.core.error_message
function fairly consistently to create exceptions and tracebacks. This function already includes a"text"
field that is a string representation of the exception. We should probably add a"traceback-text"
field (maybe renaming "text" to "exception-text") for the traceback (TODO, figure out how to stringify the frames in the traceback (seedistributed.profile.call_stack
)).Then, in a few places like
Worker.execute
andWorker.transition_executing_done
we see that we call this function and assign the exception and traceback attributesAnd then in
Worker.send_task_state_to_scheduler
we ship the information of a worker task state up to the schedulerProbably in these cases we should also include the text variants.
This sends a message to the
"task-erred"
handler, which ends up inScheduler.stimulus_task_erred
, which calls transition, which probably ends up intransition_processing_erred
, which eventually stores the exception on the TaskState.Anyway, we need to follow the thread of the exception information through the worker-scheduler machinery, and then store the text versions on the TaskState.
After that's done
We should probably just merge that in. But after that's in we should start thinking about how to present that information to the user. There are a variety of places in the dashboard that we could put this.
The text was updated successfully, but these errors were encountered: