Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store text error messages and tracebacks in scheduler #5126

Closed
mrocklin opened this issue Jul 27, 2021 · 0 comments · Fixed by #5148
Closed

Store text error messages and tracebacks in scheduler #5126

mrocklin opened this issue Jul 27, 2021 · 0 comments · Fixed by #5148

Comments

@mrocklin
Copy link
Member

Currently the worker captures all exceptions and tracebacks and puts them into the local TaskStates and also ships these up to the scheduler in serialized form so that they can be passed down to the client to be replayed. This is very helpful to allow users to understand remote errors.

However sometimes relying on the client in order to see these exceptions doesn't work well. For example a computation may get stuck somewhere. In these situations it would be nice to rely on the Dask dashboard to present these errors to us nicely. Unfortunately this is hard because these exceptions are in serialized form, and even if we felt comfortable deserializing them they may rely on libraries or hardware that we don't have (like GPUs).

As a solution I suggest that we push up and store text versions of the exception and traceback along with their serialized forms.

Implementation plan

Fortunately we use the distributed.core.error_message function fairly consistently to create exceptions and tracebacks. This function already includes a "text" field that is a string representation of the exception. We should probably add a "traceback-text" field (maybe renaming "text" to "exception-text") for the traceback (TODO, figure out how to stringify the frames in the traceback (see distributed.profile.call_stack)).

Then, in a few places like Worker.execute and Worker.transition_executing_done we see that we call this function and assign the exception and traceback attributes

                ts.exception = result["exception"]
                ts.traceback = result["traceback"]

And then in Worker.send_task_state_to_scheduler we ship the information of a worker task state up to the scheduler

        elif ts.exception is not None:
            d = {
                "op": "task-erred",
                "status": "error",
                "key": ts.key,
                "thread": self.threads.get(ts.key),
                "exception": ts.exception,
                "traceback": ts.traceback,
            }

Probably in these cases we should also include the text variants.

This sends a message to the "task-erred" handler, which ends up in Scheduler.stimulus_task_erred, which calls transition, which probably ends up in transition_processing_erred, which eventually stores the exception on the TaskState.

            ts._erred_on.add(w or worker)
            if exception is not None:
                ts._exception = exception
            if traceback is not None:
                ts._traceback = traceback

Anyway, we need to follow the thread of the exception information through the worker-scheduler machinery, and then store the text versions on the TaskState.

After that's done

We should probably just merge that in. But after that's in we should start thinking about how to present that information to the user. There are a variety of places in the dashboard that we could put this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant