Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the busy/idle execution state tracking for kernels. #1429

Merged
merged 22 commits into from
Jul 18, 2024

Conversation

ojarjur
Copy link
Contributor

@ojarjur ojarjur commented Jun 5, 2024

This PR makes two adjustments/enhancements for the way kernel execution state is tracked:

  1. Distinguish between "busy"/"idle" status messages with different parent IDs.

    A "busy" or "idle" status message is not sent in isolation, but rather in response to a specific parent message, and
    the kernel might still be busy with one parent message even after reporting that it is done (e.g. has an "idle" status)
    with a different parent message.

    Accordingly, the overall kernel execution state is only switched to "idle" once all previously seen "busy" status
    messages have had a corresponding "idle" message sent with the same parent message ID.

  2. Distinguish between different types of parent message when determining whether or not the overall kernel is "busy".

    Not every message type corresponds to a user action, so not all of them should be included in the logic for determining
    that a kernel is "busy". This is especially true for the case of kernel culling where the distinction between "idle" and
    "busy" is used to decide whether or not to cull the kernels.

    Since there might be different setups where the server admin might want to include/exclude different types of
    messages from this calculation, the set of message types used for status tracking is configurable.

Fixes #1360

This commit changes the way Jupyter server tracks both activity and
execution state so that both of those are based solely on user actions
rather than also incorporating control messages and other channels.

This is important for both correctly tracking the state of the kernel,
and also for culling kernels.

Previously, the last activity timestamp was updated on every message
on the iopub channel regardless of the type of message, and execution
state was based on every `status` message on that channel regardless
of what other message the `status` was in response to.

That behavior causes incorrect behavior for both kernel culling and
kernel status. In particular, it can result in both kernels being culled
too early and too late (depending on what the user is doing).

For example, if a user ran a long running code cell, and then switched
tabs within the JupyterLab UI, then the JupyterLab UI would send a
control message to the kernel and the kernel would respond with a status
message referencing that control message as its parent. As a result of
that, the Jupyter server would update its execution state to be `idle`
even though the long running code cell was still executing. This could
cause the kernel to be culled too soon.

Alternatively, if the user was not running anything and just switched
tabs within the JupyterLab UI, then the last activity timestamp would
be updated even though the user didn't run any code cells. That would
cause the kernel to be culled too late.

This change fixes both of those scenarios by making sure that it is
only user actions (and kernel messages in response to those user
actions) that cause the execution state and last activity timestamps
to be updated.

This initial commit contains just the code change so that I can
solicit feedback on this proposed change early.

The core of the change will have to be new tests, which will come
in a subsequent commit as part of the same pull request.
@krassowski krassowski added the bug label Jun 6, 2024
"execute_input",
"execute_reply",
"execute_request",
"inspect_request",

This comment was marked as resolved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it was also discussed in depth in #1360 (comment). I am not sure if there was a resolution though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if there was a resolution though.

We didn't come to a consensus on the set of message types to include, but I think we did get a pretty clear consensus that the set of message types should be configurable so that server admins can override whatever default we choose.

I don't know if we can get a consensus on the set of types to leave in the default value for the config, but if we do get a consensus that something should be added to (or removed from) this list, then please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on the call today, it might be more conservative to change this list to a not_tracked_message_types, as it would be less risk of accidentally forgetting that a certain message type should be counted as kernel activity. It would also mean that the risk changes from "culling a kernel that shouldn't be culled" (bad because of potential data loss) to "not culling a kernel that can be culled" (bad because of potential cost / resource use), where the former seems more disruptive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put together a tentative list of what I think would be the default set of not-tracked message types.

I got this by going through this page, and selecting all of the message types that were either:

  1. Not sent on the shell channel, or
  2. Of the form *_info_(request|reply).

This is almost disjoint from the proposed list for tracked message types. The difference is that the execute_input message that was going to be tracked is included here in the not-tracked list because it is sent on the IOPub channel instead of the shell channel.

The full list is:

comm_info_request,
comm_info_reply,
kernel_info_request,
kernel_info_reply,
shutdown_request,
shutdown_reply,
interrupt_request,
interrupt_reply,
debug_request,
debug_reply,
stream,
display_data,
update_display_data,
execute_input,
execute_result,
error,
status,
clear_output,
debug_event,
input_request,
input_reply

@krassowski @vidartf @Zsailer Can you all please help me double check this?

In particular:

  1. Is my methodology described above the right approach to take?
  2. Does the list I came up with look right?

Thanks, in advance!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ojarjur. I'll give this a closer look later today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list looks right to me.

Thanks, @ojarjur. I believe this is good to go.

@ojarjur
Copy link
Contributor Author

ojarjur commented Jun 6, 2024

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

@ojarjur
Copy link
Contributor Author

ojarjur commented Jun 6, 2024

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.

I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

@ojarjur
Copy link
Contributor Author

ojarjur commented Jun 7, 2024

It looks like the newly added test is flaky and I was able to reproduce that on my local machine; 6 out of 100 runs failed with the same error.
I'll try to eliminate the flakyness and then update this PR

Fixed now; I ran the test 100 times locally and it passed every time.

I think there's a second race condition contributing to the test flakiness; the second one isn't an issue in the test but rather a preexisting race condition in the code that the test uncovered.

Specifically, this code can wind up being run after this code if the kernel starts up quickly enough...

That, in turn, can result in a correctly set "busy" and/or "idle" kernel execution state being overwritten with an erroneous state of "starting".

I'm not sure if the kernel execution state should be set at all in the _async_start_kernel method (it seems to only matter if the kernel manager's ready method raises an exception), but I'm quite certain that if it is set then it must be set before the _finish_kernel_start method is invoked.

…ad of retrying the call to get the kernel state just mark the whole test as flaky so it gets retried
@Zsailer
Copy link
Member

Zsailer commented Jul 18, 2024

Amazing work, @ojarjur! Merging 🚀

@Zsailer Zsailer merged commit a6d2d35 into jupyter-server:main Jul 18, 2024
36 checks passed
@ojarjur ojarjur deleted the ojarjur/fix-kernel-status branch July 30, 2024 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Navigating in the JupyterLab UI can prevent idle kernels from being culled.
4 participants