From caa072ead7529a74b55d4bd53209e6f34717fa78 Mon Sep 17 00:00:00 2001 From: Lee Verberne Date: Wed, 25 Oct 2017 17:13:22 +0200 Subject: [PATCH 1/4] Update API of Troubleshooting Ephemeral Containers - Report both v1.Container & v1.ContainerStatus in PodStatus - Persist v1.Container as a container runtime label - Start ephemeral containers from the kubelet pod worker --- .../node/troubleshoot-running-pods.md | 441 +++++++++--------- 1 file changed, 229 insertions(+), 212 deletions(-) diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index 72c1cb77321..2707a2b4b39 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -50,20 +50,21 @@ A solution to troubleshoot arbitrary container images MUST: * require no administrative access to the node * have an excellent user experience (i.e. should be a feature of the platform rather than config-time trickery) -* have no *inherent* side effects to the running container image +* have no _inherent_ side effects to the running container image +* v1.Container must be available for inspection by admission controllers ## Feature Summary Any new debugging functionality will require training users. We can ease the transition by building on an existing usage pattern. We will create a new command, `kubectl debug`, which parallels an existing command, `kubectl exec`. -Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will -be similar but run a *container* in a *pod*. +Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will +be similar but run a _container_ in a _pod_. -A container created by `kubectl debug` is a *Debug Container*. Just like a +A container created by `kubectl debug` is a _Debug Container_. Just like a process run by `kubectl exec`, a Debug Container is not part of the pod spec and has no resource stored in the API. Unlike `kubectl exec`, a Debug Container -*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl +_does_ have status that is reported in `v1.PodStatus` and displayed by `kubectl describe pod`. For example, the following command would attach to a newly created container in @@ -82,22 +83,16 @@ kubectl debug target-pod This creates an interactive shell in a pod which can examine and signal other processes in the pod. It has access to the same network and IPC as processes in -the pod. It can access the filesystem of other processes by `/proc/$PID/root`. -As is already the case with regular containers, Debug Containers can enter -arbitrary namespaces of another container via `nsenter` when run with -`CAP_SYS_ADMIN`. +the pod. When [process namespace sharing](https://features.k8s.io/495) is +enabled, it can access the filesystem of other processes by `/proc/$PID/root`. +Debug Containers can enter arbitrary namespaces of another visible container via +`nsenter` when run with `CAP_SYS_ADMIN`. -*Please see the User Stories section for additional examples and Alternatives -Considered for the considerable list of other solutions we considered.* +_Please see the User Stories section for additional examples and Alternatives +Considered for the considerable list of other solutions we considered._ ## Implementation Details -The implementation of `kubectl debug` closely mirrors the implementation of -`kubectl exec`, with most of the complexity implemented in the `kubelet`. How -functionality like this best fits into Kubernetes API has been contentious. In -order to make progress, we will start with the smallest possible API change, -extending `/exec` to support Debug Containers, and iterate. - From the perspective of the user, there's a new command, `kubectl debug`, that creates a Debug Container and attaches to its console. We believe a new command will be less confusing for users than overloading `kubectl exec` with a new @@ -106,13 +101,88 @@ subsequently be used to reattach and is reported by `kubectl describe`. ### Kubernetes API Changes -#### Chosen Solution: "exec++" +There has been much discussion about how this fits best into the Kubernetes API. +The consensus is for an imperative "debug this pod" action that's implemented +mostly in the kubelet. In order to avoid new dependencies in the kubelet, this +will be implemented in the Core API. Three possible implementations follow, and +additional implementations that were evaluated and dismissed are at the end of +this document. + +All of the proposed solutions implement the user-level concept of a _Debug +Container_ using the API-level concept of an _Ephemeral Container_. The API +doesn't prescribe how an Ephemeral Container is used. It could conceivably see +use other than Debug Containers, but we don't currently have other use cases. + +#### Chosen Solution: POST to an Existing Pod + +We're modifying an existing pod, so this fits as a subresource of the target +pod. We will create a new top-level object that contains a `v1.Container` and +`POST` to the subresource to create the Debug Container. Since this is a `POST`, +we cannot upgrade the connection to streaming if we want to continue supporting +web socket clients. + +Rather than using an existing subresource like `/exec` and conditionally stream +based on `PodExecOptions`, we will create a new subresource with consistent +streaming behavior. A new subresource has the added benefit of being able to +entirely hide interface entirely behind a feature flag by conditionally +registering the new subresource. + +A `v1.Container` by itself lacks type and object metadata, so we will create a +new type: + +``` +// EphemeralContainer describes a container to attach to a running pod for troubleshooting. +type EphemeralContainer struct { + metav1.TypeMeta `json:",inline"` + + // Spec describes the Ephemeral Container to be created. + Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` + + // Most recently observed status of the container. + // This data may not be up to date. + // Populated by the system. + // Read-only. + // +optional + Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` + + // If set, the name of the container from PodSpec that this ephemeral container targets. + // If not set then the ephemeral container is run in whatever namespaces are shared + // for the pod. + TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"` +} +``` -We will extend `v1.Pod`'s `/exec` subresource to support "executing" container -images. The current `/exec` endpoint must implement `GET` to support streaming -for all clients. We don't want to encode a (potentially large) `v1.Container` as -an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific -fields required for creating a Debug Container: +**Note that Ephemeral Containers are not regular containers and should not be +used to build services.** They lack guarantees for resources or execution, and +many of the fields of `v1.Container` will not be allowed for Debug Containers. A +request will be rejected if any field is set other than the following +whitelisted fields: `Name`, `Image`, `Command`, `Args`, `WorkingDir`, `Env`, +`EnvFrom`, `ImagePullPolicy`, `SecurityContext`. `TTY` and `Stdin` are always +enabled for Debug Containers and will be ignored. + +The new `/ephemeralcontainers` subresource allows the following: + +1. A `POST` of a `EphemeralContainer` to + `/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers` to create an + Ephemeral Container running in pod `$POD_NAME`. +1. Support for stopping an Ephemeral Container **could be supported in the + future** by a `DELETE` of + `/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers/$NAME`. + +Once created, it is the responsibility of the client to watch for the +`EphemeralContainer` to appear in the `PodStatus` and then attach to the console +of a debug container using the existing attach endpoint, +`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new +container between its creation and subsequent attach will not be replayed and +can only be viewed using `kubectl log`. + +#### Alternative 1: "exec++" + +A simpler change is to extend `v1.Pod`'s `/exec` subresource to support +"executing" container images. The current `/exec` endpoint must implement `GET` +to support streaming for all clients. We don't want to encode a (potentially +large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions` +with the specific fields required for creating a Debug Container: ``` // PodExecOptions is the query options to a Pod's remote exec call @@ -130,53 +200,16 @@ type PodExecOptions struct { } ``` -After creating the Debug Container, the kubelet will upgrade the connection to -streaming and perform an attach to the container's console. If disconnected, the -Debug Container can be reattached using the pod's `/attach` endpoint with -`EphemeralContainerName`. +After creating the Ephemeral Container, the kubelet would upgrade the connection +to streaming and perform an attach to the container's console. If disconnected, +the Ephemeral Container could be reattached using the pod's `/attach` endpoint +with `EphemeralContainerName`. -Debug Containers cannot be removed via the API and instead the process must -terminate. While not ideal, this parallels existing behavior of `kubectl exec`. -To kill a Debug Container one would `attach` and exit the process interactively -or create a new Debug Container to send a signal with `kill(1)` to the original -process. - -#### Alternative 1: Debug Subresource - -Rather than extending an existing subresource, we could create a new, -non-streaming `debug` subresource. We would create a new API Object: - -``` -// DebugContainer describes a container to attach to a running pod for troubleshooting. -type DebugContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - // Name is the name of the Debug Container. Its presence will cause - // exec to create a Debug Container rather than performing a runtime exec. - Name string `json:"name,omitempty" ...` - - // Image is an optional container image name that will be used to for the Debug - // Container in the specified Pod with Command as ENTRYPOINT. If omitted a - // default image will be used. - Image string `json:"image,omitempty" ...` -} -``` - -The pod would gain a new `/debug` subresource that allows the following: - -1. A `POST` of a `PodDebugContainer` to - `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug - Container named `$NAME` running in pod `$POD_NAME`. -1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop - the Debug Container `$NAME` in pod `$POD_NAME`. - -Once created, a client would attach to the console of a debug container using -the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. - -However, this pattern does not resemble any other current usage of the API, so -we prefer to start with exec++ and reevaluate if we discover a compelling -reason. +Ephemeral Containers could not be removed via the API and instead the process +must terminate. While not ideal, this parallels existing behavior of `kubectl +exec`. To kill an Ephemeral Container one would `attach` and exit the process +interactively or create a new Ephemeral Container to send a signal with +`kill(1)` to the original process. #### Alternative 2: Declarative Configuration @@ -192,29 +225,11 @@ type EphemeralContainer struct { metav1.TypeMeta metav1.ObjectMeta - Spec EphemeralContainerSpec + Spec v1.Container Status v1.ContainerStatus } ``` -`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields -relevant to Debug Containers: - -``` -type EphemeralContainerSpec struct { - // Target is the pod in which to run the EphemeralContainer - // Required. - Target v1.ObjectReference - - Name string - Image String - Command []string - Args []string - ImagePullPolicy PullPolicy - SecurityContext *SecurityContext -} -``` - A new controller in the kubelet would watch for EphemeralContainers and create/delete debug containers. `EphemeralContainer.Status` would be updated by the kubelet at the same time it updates `ContainerStatus` for regular and init @@ -222,66 +237,104 @@ containers. Clients would create a new `EphemeralContainer` object, wait for it to be started and then attach using the pod's attach subresource and the name of the `EphemeralContainer`. -Debugging is inherently imperative, however, rather than a state for Kubernetes -to enforce. Once a Debug Container is started it should not be automatically -restarted, for example. This solution imposes additionally complexity and -dependencies on the kubelet, but it's not yet clear if the complexity is -justified. +Debugging is inherently imperative, however, and not the a desired state to +describe. Once a Debug Container is started it should not be automatically +restarted, for example. A declarative API adds new states for the kubelet to +enforce, and SIG Node strongly prefers to minimize kubelet complexity. -### Debug Container Status +### Ephemeral Container Status -The status of a Debug Container is reported in a new field in `v1.PodStatus`: +`EphemeralContainer` is included in a new field in `PodStatus`: ``` type PodStatus struct { ... - EphemeralContainerStatuses []v1.ContainerStatus + // List of user-initiated ephemeral containers that have been run in this pod. + // +optional + EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` + } ``` -This status is only populated for Debug Containers, but there's interest in -tracking status for traditional exec in a similar manner. - -Note that `Command` and `Args` would have to be tracked in the status object -because there is no spec for Debug Containers or exec. These must either be made -available by the runtime or tracked by the kubelet. For Debug Containers this -could be stored as runtime labels, but the kubelet currently has no method of -storing state across restarts for exec. Solving this problem for exec is out of -scope for Debug Containers, but we will look for a solution as we implement this -feature. - -`EphemeralContainerStatuses` is populated by the kubelet in the same way as -regular and init container statuses. This is sent to the API server and -displayed by `kubectl describe pod`. +The kubelet should be able to construct a complete `PodStatus` with no prior +state using information stored in the container runtime. +`EphemeralContainer.Status` introduces no new data, but the kubelet must also +now populate `EphemeralContainer.Spec` & +`EphemeralContainer.TargetContainerName`. + +The kubelet already persists container metadata as CRI +[labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606) +and +[annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613). +The entire v1.Container used in the request will be serialized and stored as a +runtime annotation. The value of `TargetContainerName` will be stored as a +runtime label. Persisting this data in the runtime means it will survive kubelet +restarts. + +At least for the Docker runtime, this is [an intended use of docker +labels](https://docs.docker.com/engine/userguide/labels-custom-metadata/#value-guidelines). +Docker does not document the maximum length of labels in its API. Empirically, +it supports up to the 64K constraint of the docker client's `bufio.Scanner` +size. Because the container spec may be examined in security sensitive contexts +like admission control, we will conservatively limit the size of the spec to 32K +and add a 32K minimum label length test to runtime qualification. + +`EphemeralContainer` is populated by the kubelet in the same way as regular +container statuses. This is sent to the API server and displayed by `kubectl +describe pod`. ### Creating Debug Containers -1. `kubectl` invokes the exec API as described in the preceding section. -1. The API server checks for name collisions with existing containers, performs - admission control and proxies the connection to the kubelet's - `/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint. -1. The kubelet instructs the Runtime Manager to create a Debug Container. -1. The runtime manager uses the existing `startContainer()` method to create a - container in an existing pod. `startContainer()` has one modification for - Debug Containers: it creates a new runtime label (e.g. a docker label) that - identifies this container as a Debug Container. -1. After creating the container, the kubelet schedules an asynchronous update - of `PodStatus`. The update publishes the debug container status to the API - server at which point the Debug Container becomes visible via `kubectl - describe pod`. -1. The kubelet will upgrade the connection to streaming and attach to the - container's console. - -Rather than performing the implicit attach the kubelet could return success to -the client and require the client to perform an explicit attach, but the -implicit attach maintains consistent semantics across `/exec` rather than -varying behavior based on parameters. +1. `kubectl` invokes the new API as described in the preceding section. +1. The API server checks for name collisions with existing running containers + (in both `PodSpec` and `PodStatus.EphemeralContainers`), performs + validation, admission control and proxies the connection to the kubelet's + `/ephemeralContainers/$NS/$POD_NAME` endpoint. + 1. Since a name collision could happen in the interval between container + creation and PodStatus being published to the API server, the kubelet + will perform an additional check for name collision. + 1. It is permissible to replace an exited container with one of the same + name. +1. The kubelet request handler opens an error channel and signals the pod's + sync worker with `UpdatePodOptions` that include the `EphemeralContainer` in + a new field and a callback in the existing + `UpdatePodOptions.OnCompleteFunc`. + 1. The pod sync worker runs the existing `syncPod()` with a new + `SyncPodType` of `SyncPodDebug`. + 1. The request handler blocks (with timeout) on receiving an error from + `syncPod()` via the callback. During this time, `syncPod()` starts the + ephemeral container, including fetching an image if necessary, and + publishes a new `PodStatus`. + 1. Timeout is configured by the cluster administrator and defaults to 2 + minutes. +1. `syncPod()` again checks for container name collision and starts an + ephemeral container via the new `kuberuntime.StartEphemeralContainer()`. + 1. `StartEphemeralContainer()` call uses the existing `startContainer()` + method, which gains support for targeting the namespaces of a container + by name. + 1. `syncPod()` runs only from a dedicated pod worker, resolving any races + for container creation. + 1. After initial creation, future invocations of `syncPod()` will publish + its ContainerStatus but otherwise ignore the Ephemeral Container. It + will exist for the life of the pod sandbox or it exits and is garbage + collected. In no event will it be restarted. +1. `syncPod()` then finishes a regular sync, publishing an updated PodStatus + (which includes the new `EphemeralContainer`) by its normal, existing means. + The pod worker sends its exit status to the request worker. +1. The request worker receives a (hopefully `nil`) `error` and returns it to + the client. + 1. `OnCompleteFunc` is not guaranteed to be called, so if the request + worker times out it will check the pod's `PodStatus` to see if the Debug + Container was started prior to returning an error. If this happens the + `EphemeralContainer.Spec` must be compared to verify it was the same one + as requested. +1. The client performs an attach to the debug container's console. The apiserver detects container name collisions with both containers in the pod spec and other running Debug Containers by checking -`EphemeralContainerStatuses`. In a race to create two Debug Containers with the -same name, the API server will pass both requests and the kubelet must return an -error to all but one request. +`PodStatus.EphemeralContainers`. In a race to create two Debug Containers with +the same name, the API server will pass both requests and the kubelet will +reject one in the synchronized pod worker. There are no limits on the number of Debug Containers that can be created in a pod, but exceeding a pod's resource allocation may cause the pod to be evicted. @@ -336,13 +389,13 @@ It's worth noting some things that do not change: Debug Containers have no additional privileges above what is available to any `v1.Container`. It's the equivalent of configuring an shell container in a pod -spec but created on demand. +spec except that it is created on demand. -Admission plugins that guard `/exec` must be updated for the new parameters. In +Admission plugins must be updated to guard `/ephemeralcontainers`. In particular, they should enforce the same container image policy on the `Image` parameter as is enforced for regular containers. During the alpha phase we will additionally support a container image whitelist as a kubelet flag to allow -cluster administrators to easily constraint debug container images. +cluster administrators to easily constrain debug container images. ### Additional Consideration @@ -356,15 +409,11 @@ cluster administrators to easily constraint debug container images. and exists because Kubernetes has no mechanism to attach a container prior to starting it. This larger issue will not be addressed by Debug Containers, but Debug Containers would benefit from future improvements or work arounds. -1. We do not want to describe Debug Containers using `v1.Container`. This is to - reinforce that Debug Containers are not general purpose containers by - limiting their configurability. Debug Containers should not be used to build - services. -1. Debug Containers are of limited usefulness without a shared PID namespace. - If a pod is configured with isolated PID namespaces, the Debug Container +1. Debug Containers should not be used to build services, which we've attempted + to reflect in the API. +1. If a pod is configured with isolated PID namespaces, the Debug Container will join the PID namespace of the target container. Debug Containers will - not be available with runtimes that do not implement PID namespace sharing - in some form. + not be available with runtimes that do not implement PID namespace sharing. ## Implementation Plan @@ -372,22 +421,32 @@ cluster administrators to easily constraint debug container images. #### Goals and Non-Goals for Alpha Release -We're targeting an alpha release in Kubernetes 1.9 that includes the following +We're targeting an alpha release in Kubernetes 1.10 that includes the following basic functionality: * Support in the kubelet for creating debug containers in a running pod -* A `kubectl debug` command to initiate a debug container +* A `kubectl alpha debug` command to initiate a debug container * `kubectl describe pod` will list status of debug containers running in a pod Functionality will be hidden behind an alpha feature flag and disabled by -default. The following are explicitly out of scope for the 1.9 alpha release: +default. The following are explicitly out of scope for the alpha release, but +must be resolved prior to beta release: -* Exited Debug Containers will be garbage collected as regular containers and - may disappear from the list of Debug Container Statuses. -* Security Context for the Debug Container is not configurable. It will always - be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. -* Image pull policy for the Debug Container is not configurable. It will - always be run with `PullAlways`. +* There's no guarantee that exited Debug Containers won't be garbage collected + as regular containers, so they may disappear from the list of + `EphemeralContainers`. +* We could improve reliability of `UpdatePodOptions.OnCompleteFunc` by + prioritizing based on `SyncPodType`. + +#### Kubernetes API Changes + +The following changes must be implemented in the API: + +1. `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as + described above. +1. The new subresource will be added to the pods API, validation added and + proxied to the kubelet. +1. The API server must check for Ephemeral Containers when validating `attach`. #### kubelet Implementation @@ -396,71 +455,30 @@ Performing this operation with a legacy (non-CRI) runtime will result in a not implemented error. Implementation in the kubelet will be split into the following steps: -##### Step 1: Container Type - -The first step is to add a feature gate to ensure all changes are off by -default. This will be added in the `pkg/features` `DefaultFeatureGate`. - -The runtime manager stores metadata about containers in the runtime via labels -(e.g. docker labels). These labels are used to populate the fields of -`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug -Containers differently in a few situations, we must add a new piece of metadata -to distinguish Debug Containers from regular containers. - -`startContainer()` will be updated to write a new label -`io.kubernetes.container.type` to the runtime. Existing containers will be -started with a type of `REGULAR` or `INIT`. When added in a subsequent step, -Debug Containers will start with the type `EPHEMERAL`. - -##### Step 2: Creation and Handling of Debug Containers - -This step adds methods for creating debug containers, but doesn't yet modify the -kubelet API. Since the runtime manager discards runtime (e.g. docker) labels -after populating `kubecontainer.ContainerStatus`, the label value will be stored -in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`. - -The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container` -and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented. -Currently only the Generic Runtime Manager (i.e. the CRI) implements the -`DebugContainerRunner` interface. - -The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to -create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug -Containers unless the sandbox is restarted. - -##### Step 3: kubelet API changes - -The kubelet exposes the new functionality in its existing `/exec/` endpoint. -`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls -`RunDebugContainer()`, and performs the attach. - -##### Step 4: Reporting EphemeralContainerStatus - -The last major change to the kubelet is to populate -v1.`PodStatus.EphemeralContainerStatuses` based on the -`kubecontainer.ContainerStatus` for the Debug Container. - -#### Kubernetes API Changes - -There are two changes to be made to the Kubernetes, which will be made -independently: - -1. `v1.PodExecOptions` must be extended with new fields. -1. `v1.PodStatus` gains a new field to hold Debug Container statuses. - -In all cases, new fields will be prepended with `Alpha` for the duration of this -feature's alpha status. +1. New container metadata `ContainerType`, `ContainerSpec` & + `TargetContainerName` is stored using CRI labels and annotations. + `kubecontainer.ContainerStatus` will be extended with a `ContainerType` + field (possible values: `REGULAR`, `INIT` & `EPHEMERAL`) so a container can + be identified as a debug container. +1. `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the + existing `startContainer()`. +1. The kubelet gains a `RunDebugContainer()` method which accepts a + `v1.EphemeralContainer` and triggers a pod sync to create the debug + container. The existing `generateAPIPodStatus()` will be update to also + populate `EphemeralContainers`. +1. The kubelet API gains the new `/ephemeralContainers/` endpoint to create the + Debug Container. #### kubectl changes In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a `kubectl alpha` command to contain alpha features. We will add `kubectl alpha debug` to invoke Debug Containers. `kubectl` does not use feature gates, so -`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an +`kubectl alpha debug` will be visible by default in `kubectl` 1.10 and return an error when used on a cluster with the feature disabled. -`kubectl describe pod` will report the contents of `EphemeralContainerStatuses` -when not empty as it means the feature is enabled. The field will be hidden when +`kubectl describe pod` will report the contents of `EphemeralContainers` when +not empty as it means the feature is enabled. The field will be hidden when empty. ## Appendices @@ -729,8 +747,7 @@ coupling it with container images. * [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140) * [CRI Tracking Issue](https://issues.k8s.io/28789) * [CRI: expose optional runtime features](https://issues.k8s.io/32803) -* [Resource QoS in - Kubernetes](resource-qos.md) +* [Resource QoS in Kubernetes](resource-qos.md) * Related Features * [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across containers in a pod From 946532db82be7a68847df161b2535909919a2763 Mon Sep 17 00:00:00 2001 From: Lee Verberne Date: Wed, 21 Mar 2018 15:54:52 +0100 Subject: [PATCH 2/4] Update EphemeralContainers in apiserver not kubelet --- .../node/troubleshoot-running-pods.md | 224 +++++++----------- 1 file changed, 85 insertions(+), 139 deletions(-) diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index 2707a2b4b39..df6dc97ca59 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -1,6 +1,6 @@ # Troubleshoot Running Pods -* Status: Pending +* Status: Pending Implementation * Version: Alpha * Implementation Owner: @verb @@ -102,33 +102,22 @@ subsequently be used to reattach and is reported by `kubectl describe`. ### Kubernetes API Changes There has been much discussion about how this fits best into the Kubernetes API. -The consensus is for an imperative "debug this pod" action that's implemented -mostly in the kubelet. In order to avoid new dependencies in the kubelet, this -will be implemented in the Core API. Three possible implementations follow, and -additional implementations that were evaluated and dismissed are at the end of -this document. +The consensus is for an imperative "debug this pod" action whereby Kubernetes +creates a new, temporary container in a pod on command. In order to avoid new +dependencies in the kubelet, this will be implemented in the Core API. Three +possible implementations follow, and additional implementations that were +evaluated and dismissed are at the end of this document. All of the proposed solutions implement the user-level concept of a _Debug Container_ using the API-level concept of an _Ephemeral Container_. The API doesn't prescribe how an Ephemeral Container is used. It could conceivably see use other than Debug Containers, but we don't currently have other use cases. -#### Chosen Solution: POST to an Existing Pod +#### Chosen Solution: Subresource to Update PodStatus -We're modifying an existing pod, so this fits as a subresource of the target -pod. We will create a new top-level object that contains a `v1.Container` and -`POST` to the subresource to create the Debug Container. Since this is a `POST`, -we cannot upgrade the connection to streaming if we want to continue supporting -web socket clients. - -Rather than using an existing subresource like `/exec` and conditionally stream -based on `PodExecOptions`, we will create a new subresource with consistent -streaming behavior. A new subresource has the added benefit of being able to -entirely hide interface entirely behind a feature flag by conditionally -registering the new subresource. - -A `v1.Container` by itself lacks type and object metadata, so we will create a -new type: +An Ephemeral Container is not part of the pod specification as it's not part of +the intended state of the pod, but we describe it using the same primitives as +`PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target: ``` // EphemeralContainer describes a container to attach to a running pod for troubleshooting. @@ -152,26 +141,37 @@ type EphemeralContainer struct { } ``` +All Ephemeral Containers that have been created in a pod are listed in the pod's +status: + +``` +type PodStatus struct { + ... + // List of user-initiated ephemeral containers that have been run in this pod. + // +optional + EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` + +} +``` + +To create a new Ephemeral Container, one appends a new `EphemeralContainer` with +the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in +the API. This is accomplished via a new subresource, `/ephemeralcontainers`, +which enforces the append-only semantics and authorization. This is similar to +the `/status` subresource used by the kubelet to modify pod status. + **Note that Ephemeral Containers are not regular containers and should not be used to build services.** They lack guarantees for resources or execution, and many of the fields of `v1.Container` will not be allowed for Debug Containers. A -request will be rejected if any field is set other than the following +pod update will fail validation if any field is set other than the following whitelisted fields: `Name`, `Image`, `Command`, `Args`, `WorkingDir`, `Env`, `EnvFrom`, `ImagePullPolicy`, `SecurityContext`. `TTY` and `Stdin` are always enabled for Debug Containers and will be ignored. -The new `/ephemeralcontainers` subresource allows the following: - -1. A `POST` of a `EphemeralContainer` to - `/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers` to create an - Ephemeral Container running in pod `$POD_NAME`. -1. Support for stopping an Ephemeral Container **could be supported in the - future** by a `DELETE` of - `/api/v1/namespaces/$NS/pods/$POD_NAME/ephemeralcontainers/$NAME`. - -Once created, it is the responsibility of the client to watch for the -`EphemeralContainer` to appear in the `PodStatus` and then attach to the console -of a debug container using the existing attach endpoint, +Once the pod object is updated, the kubelet worker watching this pod will launch +the Ephemeral Container and update its status. The client is expected to watch +for the creation of the container status and then attach to the console of a +debug container using the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new container between its creation and subsequent attach will not be replayed and can only be viewed using `kubectl log`. @@ -211,7 +211,7 @@ exec`. To kill an Ephemeral Container one would `attach` and exit the process interactively or create a new Ephemeral Container to send a signal with `kill(1)` to the original process. -#### Alternative 2: Declarative Configuration +#### Alternative 2: Ephemeral Container Controller Using subresources is an imperative style API where the client instructs the kubelet to perform an action, but in general Kubernetes prefers declarative APIs @@ -244,31 +244,16 @@ enforce, and SIG Node strongly prefers to minimize kubelet complexity. ### Ephemeral Container Status -`EphemeralContainer` is included in a new field in `PodStatus`: - -``` -type PodStatus struct { - ... - // List of user-initiated ephemeral containers that have been run in this pod. - // +optional - EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` - -} -``` - -The kubelet should be able to construct a complete `PodStatus` with no prior -state using information stored in the container runtime. -`EphemeralContainer.Status` introduces no new data, but the kubelet must also -now populate `EphemeralContainer.Spec` & -`EphemeralContainer.TargetContainerName`. - -The kubelet already persists container metadata as CRI +We wish for the kubelet to be able to construct `PodStatus` without relying on +prior state, so we will store `EphemeralContainer.Spec` & +`EphemeralContainer.TargetContainerName` as runtime metadata. The kubelet +currently persists container metadata as CRI [labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606) and [annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613). The entire v1.Container used in the request will be serialized and stored as a runtime annotation. The value of `TargetContainerName` will be stored as a -runtime label. Persisting this data in the runtime means it will survive kubelet +runtime label. Persisting this data in the runtime means it survives kubelet restarts. At least for the Docker runtime, this is [an intended use of docker @@ -279,72 +264,45 @@ size. Because the container spec may be examined in security sensitive contexts like admission control, we will conservatively limit the size of the spec to 32K and add a 32K minimum label length test to runtime qualification. -`EphemeralContainer` is populated by the kubelet in the same way as regular -container statuses. This is sent to the API server and displayed by `kubectl -describe pod`. +`EphemeralContainer.Status` is populated by the kubelet in the same way as +regular container statuses. This is sent to the API server and displayed by +`kubectl describe pod`. ### Creating Debug Containers -1. `kubectl` invokes the new API as described in the preceding section. -1. The API server checks for name collisions with existing running containers - (in both `PodSpec` and `PodStatus.EphemeralContainers`), performs - validation, admission control and proxies the connection to the kubelet's - `/ephemeralContainers/$NS/$POD_NAME` endpoint. - 1. Since a name collision could happen in the interval between container - creation and PodStatus being published to the API server, the kubelet - will perform an additional check for name collision. - 1. It is permissible to replace an exited container with one of the same - name. -1. The kubelet request handler opens an error channel and signals the pod's - sync worker with `UpdatePodOptions` that include the `EphemeralContainer` in - a new field and a callback in the existing - `UpdatePodOptions.OnCompleteFunc`. - 1. The pod sync worker runs the existing `syncPod()` with a new - `SyncPodType` of `SyncPodDebug`. - 1. The request handler blocks (with timeout) on receiving an error from - `syncPod()` via the callback. During this time, `syncPod()` starts the - ephemeral container, including fetching an image if necessary, and - publishes a new `PodStatus`. - 1. Timeout is configured by the cluster administrator and defaults to 2 - minutes. -1. `syncPod()` again checks for container name collision and starts an - ephemeral container via the new `kuberuntime.StartEphemeralContainer()`. - 1. `StartEphemeralContainer()` call uses the existing `startContainer()` - method, which gains support for targeting the namespaces of a container - by name. - 1. `syncPod()` runs only from a dedicated pod worker, resolving any races - for container creation. +1. `kubectl` constructs `EphemeralContainer.Spec` and + `EphemeralContainer.TargetContainerName` based on command line arguments. It + `PUT`s the modified pod to the pod's `/ephemeralcontainers`. +1. The apiserver discards changes to fields other than + `Pod.Status.EphemeralContainers` and validates the pod update. + 1. Update validation fails if existing Ephemeral Containers are removed or + changed, or if the new Ephemeral Container has a non-empty status. + 1. Pod validation fails if container spec contains fields disallowed for + Ephemeral Containers, has the same name as a container in the spec, or + has the same name as another running Ephemeral Container. (see below) + 1. API resource versioning resolves update races. +1. The kubelet's pod watcher notices the update and triggers a `syncPod()`. + During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()` + for any Ephemeral Container with an empty status. + 1. `StartEphemeralContainer()` uses the existing `startContainer()` method, + which gains support for targeting the namespaces of a container by name. 1. After initial creation, future invocations of `syncPod()` will publish its ContainerStatus but otherwise ignore the Ephemeral Container. It will exist for the life of the pod sandbox or it exits and is garbage collected. In no event will it be restarted. -1. `syncPod()` then finishes a regular sync, publishing an updated PodStatus - (which includes the new `EphemeralContainer`) by its normal, existing means. - The pod worker sends its exit status to the request worker. -1. The request worker receives a (hopefully `nil`) `error` and returns it to - the client. - 1. `OnCompleteFunc` is not guaranteed to be called, so if the request - worker times out it will check the pod's `PodStatus` to see if the Debug - Container was started prior to returning an error. If this happens the - `EphemeralContainer.Spec` must be compared to verify it was the same one - as requested. +1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which + includes the new `EphemeralContainer`) by its normal, existing means. 1. The client performs an attach to the debug container's console. -The apiserver detects container name collisions with both containers in the pod -spec and other running Debug Containers by checking -`PodStatus.EphemeralContainers`. In a race to create two Debug Containers with -the same name, the API server will pass both requests and the kubelet will -reject one in the synchronized pod worker. - There are no limits on the number of Debug Containers that can be created in a pod, but exceeding a pod's resource allocation may cause the pod to be evicted. ### Restarting and Reattaching Debug Containers Debug Containers will never be restarted automatically. It is possible to -replace a Debug Container that has exited by re-using a Debug Container name. It -is an error to attempt to replace a Debug Container that is still running, which -is detected by both the API server and the kubelet. +"restart" a Debug Container by by re-using the name of a Debug Container that +has exited. It is an error to re-use the name of a Debug Container that is still +running, which is detected by API server validation. One can reattach to a Debug Container using `kubectl attach`. When supported by a runtime, multiple clients can attach to a single debug container and share the @@ -352,7 +310,7 @@ terminal. This is supported by Docker. ### Killing Debug Containers -Debug containers will not be killed automatically until the pod (specifically, +Debug containers will not be killed automatically unless the pod (specifically, the pod sandbox) is destroyed. Debug Containers will stop when their command exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if their connection is interrupted. @@ -367,23 +325,16 @@ are necessary in the kubelet: the pod spec. 1. As an exception to the above, `SyncPod()` will kill Debug Containers when the pod sandbox changes since a lone Debug Container in an abandoned sandbox - is not useful. Debug Containers are not automatically started in the new + is not useful. Debug Containers are not started automatically in the new sandbox. 1. `convertStatusToAPIStatus()` must sort Debug Containers status into - `EphemeralContainerStatuses` similar to as it does for + `EphemeralContainer.Status` similar to as it does for `InitContainerStatuses` -1. The kubelet must preserve `ContainerStatus` on debug containers for - reporting. 1. Debug Containers must be excluded from calculation of pod phase and condition -It's worth noting some things that do not change: - -1. `KillPod()` already operates on all running containers returned by the - runtime. -1. Containers created prior to this feature being enabled will have a - `containerType` of `""`. Since this does not match `"EPHEMERAL"` the special - handling of Debug Containers is backwards compatible. +`KillPod()` already operates on all running containers returned by the runtime +and requires no changes ### Security Considerations @@ -392,10 +343,8 @@ Debug Containers have no additional privileges above what is available to any spec except that it is created on demand. Admission plugins must be updated to guard `/ephemeralcontainers`. In -particular, they should enforce the same container image policy on the `Image` -parameter as is enforced for regular containers. During the alpha phase we will -additionally support a container image whitelist as a kubelet flag to allow -cluster administrators to easily constrain debug container images. +particular, they should enforce the same container image policy on the +`EphemeralContainer.Spec` parameter as is enforced for regular containers. ### Additional Consideration @@ -421,7 +370,7 @@ cluster administrators to easily constrain debug container images. #### Goals and Non-Goals for Alpha Release -We're targeting an alpha release in Kubernetes 1.10 that includes the following +We're targeting an alpha release in Kubernetes 1.11 that includes the following basic functionality: * Support in the kubelet for creating debug containers in a running pod @@ -429,14 +378,15 @@ basic functionality: * `kubectl describe pod` will list status of debug containers running in a pod Functionality will be hidden behind an alpha feature flag and disabled by -default. The following are explicitly out of scope for the alpha release, but -must be resolved prior to beta release: +default. -* There's no guarantee that exited Debug Containers won't be garbage collected - as regular containers, so they may disappear from the list of - `EphemeralContainers`. -* We could improve reliability of `UpdatePodOptions.OnCompleteFunc` by - prioritizing based on `SyncPodType`. +Since the kubelet stores Debug Container metadata as runtime labels, it's lost +when Debug Containers are garbage collected. For the alpha release we will rely +on the apiserver to store the `EphemeralContainer` for garbage collected +containers. The kubelet will preserve any `EphemeralContainer` it doesn't +recognize when updating status. In the event that a `PodStatus` is lost and we +need to regenerate it from scratch, `EphemeralContainers` will only contain +Debug Containers that have not been garbage collected. #### Kubernetes API Changes @@ -444,8 +394,7 @@ The following changes must be implemented in the API: 1. `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as described above. -1. The new subresource will be added to the pods API, validation added and - proxied to the kubelet. +1. The new subresource will be added to the pods API. 1. The API server must check for Ephemeral Containers when validating `attach`. #### kubelet Implementation @@ -462,19 +411,16 @@ following steps: be identified as a debug container. 1. `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the existing `startContainer()`. -1. The kubelet gains a `RunDebugContainer()` method which accepts a - `v1.EphemeralContainer` and triggers a pod sync to create the debug - container. The existing `generateAPIPodStatus()` will be update to also - populate `EphemeralContainers`. -1. The kubelet API gains the new `/ephemeralContainers/` endpoint to create the - Debug Container. +1. `syncPod()` will call `StartEphemeralContainer()` to start the Debug + Container. The existing `generateAPIPodStatus()` will be updated to also + populate `EphemeralContainers.Status`. #### kubectl changes In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a `kubectl alpha` command to contain alpha features. We will add `kubectl alpha debug` to invoke Debug Containers. `kubectl` does not use feature gates, so -`kubectl alpha debug` will be visible by default in `kubectl` 1.10 and return an +`kubectl alpha debug` will be visible by default in `kubectl` 1.11 and return an error when used on a cluster with the feature disabled. `kubectl describe pod` will report the contents of `EphemeralContainers` when From 660f409cdd9979782455984f9df2c14b76cf1985 Mon Sep 17 00:00:00 2001 From: Lee Verberne Date: Mon, 23 Apr 2018 21:51:48 +0200 Subject: [PATCH 3/4] Pod Troubleshooting: remove the requirement for an in-PodStatus log of all Debug Containers --- .../node/troubleshoot-running-pods.md | 124 +++++++++--------- 1 file changed, 59 insertions(+), 65 deletions(-) diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index df6dc97ca59..cb86c35b8a0 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -1,6 +1,6 @@ # Troubleshoot Running Pods -* Status: Pending Implementation +* Status: Implementing * Version: Alpha * Implementation Owner: @verb @@ -45,7 +45,7 @@ A solution to troubleshoot arbitrary container images MUST: * fetch troubleshooting utilities at debug time rather than at the time of pod creation * be compatible with admission controllers and audit logging -* allow discovery of debugging status +* allow discovery of current debugging status * support arbitrary runtimes via the CRI (possibly with reduced feature set) * require no administrative access to the node * have an excellent user experience (i.e. should be a feature of the platform @@ -62,10 +62,9 @@ Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will be similar but run a _container_ in a _pod_. A container created by `kubectl debug` is a _Debug Container_. Just like a -process run by `kubectl exec`, a Debug Container is not part of the pod spec and -has no resource stored in the API. Unlike `kubectl exec`, a Debug Container -_does_ have status that is reported in `v1.PodStatus` and displayed by `kubectl -describe pod`. +process run by `kubectl exec`, a Debug Container is not part of the pod spec. +Unlike `kubectl exec`, a Debug Container _does_ have status that is reported in +`v1.PodStatus` and displayed by `kubectl describe pod`. For example, the following command would attach to a newly created container in a pod: @@ -102,11 +101,11 @@ subsequently be used to reattach and is reported by `kubectl describe`. ### Kubernetes API Changes There has been much discussion about how this fits best into the Kubernetes API. -The consensus is for an imperative "debug this pod" action whereby Kubernetes -creates a new, temporary container in a pod on command. In order to avoid new -dependencies in the kubelet, this will be implemented in the Core API. Three -possible implementations follow, and additional implementations that were -evaluated and dismissed are at the end of this document. +The consensus is for an imperative "debug this pod" action whereby the kubelet +creates a new, temporary container in a pod on command. SIG Node would like to +avoid new dependencies in the kubelet, so this will be implemented in the Core +API. Three possible implementations follow, and additional implementations that +were evaluated and dismissed are at the end of this document. All of the proposed solutions implement the user-level concept of a _Debug Container_ using the API-level concept of an _Ephemeral Container_. The API @@ -116,8 +115,8 @@ use other than Debug Containers, but we don't currently have other use cases. #### Chosen Solution: Subresource to Update PodStatus An Ephemeral Container is not part of the pod specification as it's not part of -the intended state of the pod, but we describe it using the same primitives as -`PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target: +the declared state of the pod, but we describe it using the same primitives as +in `PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target: ``` // EphemeralContainer describes a container to attach to a running pod for troubleshooting. @@ -141,8 +140,7 @@ type EphemeralContainer struct { } ``` -All Ephemeral Containers that have been created in a pod are listed in the pod's -status: +Ephemeral Containers for a pod are listed in the pod's status: ``` type PodStatus struct { @@ -156,25 +154,29 @@ type PodStatus struct { To create a new Ephemeral Container, one appends a new `EphemeralContainer` with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in -the API. This is accomplished via a new subresource, `/ephemeralcontainers`, -which enforces the append-only semantics and authorization. This is similar to -the `/status` subresource used by the kubelet to modify pod status. +the API. Users cannot normally modify the pod status, so we'll create a new +subresource `/ephemeralcontainers` that allows an update of solely +`EphemeralContainers` and enforces append-only semantics. **Note that Ephemeral Containers are not regular containers and should not be -used to build services.** They lack guarantees for resources or execution, and -many of the fields of `v1.Container` will not be allowed for Debug Containers. A -pod update will fail validation if any field is set other than the following -whitelisted fields: `Name`, `Image`, `Command`, `Args`, `WorkingDir`, `Env`, -`EnvFrom`, `ImagePullPolicy`, `SecurityContext`. `TTY` and `Stdin` are always -enabled for Debug Containers and will be ignored. - -Once the pod object is updated, the kubelet worker watching this pod will launch -the Ephemeral Container and update its status. The client is expected to watch -for the creation of the container status and then attach to the console of a -debug container using the existing attach endpoint, -`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new -container between its creation and subsequent attach will not be replayed and -can only be viewed using `kubectl log`. +used to build services.** They lack guarantees for resources or execution, they +will never be automatically restarted, and many of the fields of `v1.Container` +will not be allowed for Debug Containers. In particular, the following fields +are explicitly disallowed by API validation: `resources`, `ports`, +`livenessProbe`, `readinessProbe`, and `lifecycle`. + +The subresources `attach`, `exec`, `log`, and `portforward` are available for +Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl +attach`, `kubelet exec`, `kubectl log`, and `kubectl port-forward` will work for +Ephemeral Containers. + +Once the pod is updated, the kubelet worker watching this pod will launch the +Ephemeral Container and update its status. The client is expected to watch for +the creation of the container status and then attach to the console of a debug +container using the existing attach endpoint, +`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that output of the new +container occurring between its creation and attach will not be replayed, but it +can be viewed using `kubectl log`. #### Alternative 1: "exec++" @@ -244,14 +246,14 @@ enforce, and SIG Node strongly prefers to minimize kubelet complexity. ### Ephemeral Container Status -We wish for the kubelet to be able to construct `PodStatus` without relying on -prior state, so we will store `EphemeralContainer.Spec` & -`EphemeralContainer.TargetContainerName` as runtime metadata. The kubelet -currently persists container metadata as CRI +The kubelet should be able to construct `PodStatus` without relying on prior +state, so we will store the Ephemeral Container's `Spec` and +`TargetContainerName` as runtime metadata. The kubelet persists container +metadata as CRI [labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606) and [annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613). -The entire v1.Container used in the request will be serialized and stored as a +The entire `v1.Container` used in the request will be serialized and stored as a runtime annotation. The value of `TargetContainerName` will be stored as a runtime label. Persisting this data in the runtime means it survives kubelet restarts. @@ -260,36 +262,36 @@ At least for the Docker runtime, this is [an intended use of docker labels](https://docs.docker.com/engine/userguide/labels-custom-metadata/#value-guidelines). Docker does not document the maximum length of labels in its API. Empirically, it supports up to the 64K constraint of the docker client's `bufio.Scanner` -size. Because the container spec may be examined in security sensitive contexts -like admission control, we will conservatively limit the size of the spec to 32K -and add a 32K minimum label length test to runtime qualification. +size. We will conservatively limit the size of the spec to 32K and add a 32K +minimum label length test to runtime qualification. `EphemeralContainer.Status` is populated by the kubelet in the same way as -regular container statuses. This is sent to the API server and displayed by -`kubectl describe pod`. +regular container statuses. The kubelet then updates the pod's status in the API +server using the pod's `/status` endpoint, which imposes no restrictions on +updates to `ephemeralContainers`. ### Creating Debug Containers -1. `kubectl` constructs `EphemeralContainer.Spec` and - `EphemeralContainer.TargetContainerName` based on command line arguments. It - `PUT`s the modified pod to the pod's `/ephemeralcontainers`. -1. The apiserver discards changes to fields other than +1. `kubectl` constructs and `EphemeralContainer` based on command line + arguments and appends it to `Pod.Status.EphemeralContainers`. It `PUT`s the + modified pod to the pod's `/ephemeralcontainers`. +1. The apiserver discards changes other than additions to `Pod.Status.EphemeralContainers` and validates the pod update. - 1. Update validation fails if existing Ephemeral Containers are removed or - changed, or if the new Ephemeral Container has a non-empty status. + 1. Update discards `EphemeralContainer.Status` for new Ephemeral + Containers. 1. Pod validation fails if container spec contains fields disallowed for - Ephemeral Containers, has the same name as a container in the spec, or - has the same name as another running Ephemeral Container. (see below) + Ephemeral Containers or the same name as a container in the spec or + `EphemeralContainers`. 1. API resource versioning resolves update races. 1. The kubelet's pod watcher notices the update and triggers a `syncPod()`. During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()` - for any Ephemeral Container with an empty status. + for any new Ephemeral Container. 1. `StartEphemeralContainer()` uses the existing `startContainer()` method, which gains support for targeting the namespaces of a container by name. 1. After initial creation, future invocations of `syncPod()` will publish its ContainerStatus but otherwise ignore the Ephemeral Container. It - will exist for the life of the pod sandbox or it exits and is garbage - collected. In no event will it be restarted. + will exist for the life of the pod sandbox or it exits. In no event will + it be restarted. 1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which includes the new `EphemeralContainer`) by its normal, existing means. 1. The client performs an attach to the debug container's console. @@ -299,10 +301,10 @@ pod, but exceeding a pod's resource allocation may cause the pod to be evicted. ### Restarting and Reattaching Debug Containers -Debug Containers will never be restarted automatically. It is possible to -"restart" a Debug Container by by re-using the name of a Debug Container that -has exited. It is an error to re-use the name of a Debug Container that is still -running, which is detected by API server validation. +Debug Containers will not be restarted. + +We want to be more user friendly by allowing re-use of the name of an exited +debug container, but this will be left for a future improvement. One can reattach to a Debug Container using `kubectl attach`. When supported by a runtime, multiple clients can attach to a single debug container and share the @@ -380,14 +382,6 @@ basic functionality: Functionality will be hidden behind an alpha feature flag and disabled by default. -Since the kubelet stores Debug Container metadata as runtime labels, it's lost -when Debug Containers are garbage collected. For the alpha release we will rely -on the apiserver to store the `EphemeralContainer` for garbage collected -containers. The kubelet will preserve any `EphemeralContainer` it doesn't -recognize when updating status. In the event that a `PodStatus` is lost and we -need to regenerate it from scratch, `EphemeralContainers` will only contain -Debug Containers that have not been garbage collected. - #### Kubernetes API Changes The following changes must be implemented in the API: From 502b75727129e0aba965f3955ff3db6526a5593d Mon Sep 17 00:00:00 2001 From: Lee Verberne Date: Wed, 15 Aug 2018 14:29:55 +0200 Subject: [PATCH 4/4] Move Ephemeral Containers into pod.Spec After discussing with API reviewers and relevant SIG leads, we've agreed that the configuration for Ephemeral Containers should live in the pod spec. --- .../node/troubleshoot-running-pods.md | 546 +++++++++--------- 1 file changed, 277 insertions(+), 269 deletions(-) diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index cb86c35b8a0..3fcc0223125 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an execution platform for custom binaries produced by a build system. These users can forgo the scripted OS install of traditional Dockerfiles and instead `COPY` the output of their build system into a container image built `FROM scratch` or -a [distroless container -image](https://github.com/GoogleCloudPlatform/distroless). This confers several -advantages: +a +[distroless container image](https://github.com/GoogleCloudPlatform/distroless). +This confers several advantages: 1. **Minimal images** lower operational burden and reduce attack vectors. 1. **Immutable images** improve correctness and reliability. @@ -61,10 +61,9 @@ command, `kubectl debug`, which parallels an existing command, `kubectl exec`. Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will be similar but run a _container_ in a _pod_. -A container created by `kubectl debug` is a _Debug Container_. Just like a -process run by `kubectl exec`, a Debug Container is not part of the pod spec. -Unlike `kubectl exec`, a Debug Container _does_ have status that is reported in -`v1.PodStatus` and displayed by `kubectl describe pod`. +A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl +exec`, Debug Containers have status that is reported in `PodStatus` and +displayed by `kubectl describe pod`. For example, the following command would attach to a newly created container in a pod: @@ -100,70 +99,90 @@ subsequently be used to reattach and is reported by `kubectl describe`. ### Kubernetes API Changes -There has been much discussion about how this fits best into the Kubernetes API. -The consensus is for an imperative "debug this pod" action whereby the kubelet -creates a new, temporary container in a pod on command. SIG Node would like to -avoid new dependencies in the kubelet, so this will be implemented in the Core -API. Three possible implementations follow, and additional implementations that -were evaluated and dismissed are at the end of this document. +This will be implemented in the Core API to avoid new dependencies in the +kubelet. The user-level concept of a _Debug Container_ implemented with the +API-level concept of an _Ephemeral Container_. The API doesn't require an +Ephemeral Container to be used as a Debug Container. It's intended as a general +purpose construct for running a short-lived process in a pod. -All of the proposed solutions implement the user-level concept of a _Debug -Container_ using the API-level concept of an _Ephemeral Container_. The API -doesn't prescribe how an Ephemeral Container is used. It could conceivably see -use other than Debug Containers, but we don't currently have other use cases. +#### Pod Changes -#### Chosen Solution: Subresource to Update PodStatus - -An Ephemeral Container is not part of the pod specification as it's not part of -the declared state of the pod, but we describe it using the same primitives as -in `PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target: +Ephemeral Containers are represented in `PodSpec` and `PodStatus`: ``` -// EphemeralContainer describes a container to attach to a running pod for troubleshooting. -type EphemeralContainer struct { - metav1.TypeMeta `json:",inline"` - - // Spec describes the Ephemeral Container to be created. - Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` - - // Most recently observed status of the container. - // This data may not be up to date. - // Populated by the system. - // Read-only. - // +optional - Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` +type PodSpec struct { + ... + // List of user-initiated ephemeral containers to run in this pod. + // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature. + // +optional + EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"` +} - // If set, the name of the container from PodSpec that this ephemeral container targets. - // If not set then the ephemeral container is run in whatever namespaces are shared - // for the pod. - TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"` +type PodStatus struct { + ... + // Status for any Ephemeral Containers that running in this pod. + // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature. + // +optional + EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"` } ``` -Ephemeral Containers for a pod are listed in the pod's status: +`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and +`InitContainerStatuses`, but `EphemeralContainers` introduces a new type: ``` -type PodStatus struct { - ... - // List of user-initiated ephemeral containers that have been run in this pod. - // +optional - EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` - +// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions +// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag. +type EphemeralContainer struct { + // Spec describes the Ephemeral Container to be created. + Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"` + + // If set, the name of the container from PodSpec that this ephemeral container targets. + // The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container. + // If not set then the ephemeral container is run in whatever namespaces are shared + // for the pod. + // +optional + TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"` } ``` -To create a new Ephemeral Container, one appends a new `EphemeralContainer` with -the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in -the API. Users cannot normally modify the pod status, so we'll create a new -subresource `/ephemeralcontainers` that allows an update of solely -`EphemeralContainers` and enforces append-only semantics. +Much of the utility of Ephemeral Containers comes from the ability to run a +container within the PID namespace of another container. `TargetContainerName` +allows targeting a container that doesn't share its PID namespace with the rest +of the pod. We must modify the CRI to enable this functionality (see below). + +##### Alternative Considered: Omitting TargetContainerName + +It would be simpler for the API, kubelet and kubectl if `EphemeralContainers` +was a `[]Container`, but as isolated PID namespaces will be the default for some +time, being able to target a container will provide a better user experience. -**Note that Ephemeral Containers are not regular containers and should not be -used to build services.** They lack guarantees for resources or execution, they -will never be automatically restarted, and many of the fields of `v1.Container` -will not be allowed for Debug Containers. In particular, the following fields -are explicitly disallowed by API validation: `resources`, `ports`, -`livenessProbe`, `readinessProbe`, and `lifecycle`. +#### Updates + +Most fields of `Pod.Spec` are immutable once created. There is a short whitelist +of fields which may be updated, and we could extend this to include +`EphemeralContainers`. The ability to add new containers is a large change for +Pod, however, and we'd like to begin conservatively by enforcing the following +best practices: + +1. Ephemeral Containers lack guarantees for resources or execution, and they + will never be automatically restarted. To avoid pods that depend on + Ephemeral Containers, we allow their addition only in pod updates and + disallow them during pod create. +1. Some fields of `v1.Container` imply a fundamental role in a pod. We will + disallow the following fields in Ephemeral Containers: `resources`, `ports`, + `livenessProbe`, `readinessProbe`, and `lifecycle.` +1. Cluster administrators may want to restrict access to Ephemeral Containers + independent of other pod updates. + +To enforce these restrictions and new permissions, we will introduce a new Pod +subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified +via this subresource. `EphemeralContainerStatuses` is updated with everything +else in `Pod.Status` via `/status`. + +To create a new Ephemeral Container, one appends a new `EphemeralContainer` with +the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and +`PUT`s the pod to `/ephemeralcontainers`. The subresources `attach`, `exec`, `log`, and `portforward` are available for Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl @@ -174,111 +193,34 @@ Once the pod is updated, the kubelet worker watching this pod will launch the Ephemeral Container and update its status. The client is expected to watch for the creation of the container status and then attach to the console of a debug container using the existing attach endpoint, -`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that output of the new +`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new container occurring between its creation and attach will not be replayed, but it can be viewed using `kubectl log`. -#### Alternative 1: "exec++" - -A simpler change is to extend `v1.Pod`'s `/exec` subresource to support -"executing" container images. The current `/exec` endpoint must implement `GET` -to support streaming for all clients. We don't want to encode a (potentially -large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions` -with the specific fields required for creating a Debug Container: - -``` -// PodExecOptions is the query options to a Pod's remote exec call -type PodExecOptions struct { - ... - // EphemeralContainerName is the name of an ephemeral container in which the - // command ought to be run. Either both EphemeralContainerName and - // EphemeralContainerImage fields must be set, or neither. - EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...` - - // EphemeralContainerImage is the image of an ephemeral container in which the command - // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage - // fields must be set, or neither. - EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...` -} -``` +##### Alternative Considered: Standard Pod Updates -After creating the Ephemeral Container, the kubelet would upgrade the connection -to streaming and perform an attach to the container's console. If disconnected, -the Ephemeral Container could be reattached using the pod's `/attach` endpoint -with `EphemeralContainerName`. +It would simplify initial implementation if we updated the pod spec via the +normal means, and switched to a new update subresource if required at a future +date. It's easier to begin with a too-restrictive policy than a too-permissive +one on which users come to rely, and we expect to be able to remove the +`/ephemeralcontainers` subresource prior to exiting alpha should it prove +unnecessary. -Ephemeral Containers could not be removed via the API and instead the process -must terminate. While not ideal, this parallels existing behavior of `kubectl -exec`. To kill an Ephemeral Container one would `attach` and exit the process -interactively or create a new Ephemeral Container to send a signal with -`kill(1)` to the original process. - -#### Alternative 2: Ephemeral Container Controller +### Container Runtime Interface (CRI) changes -Using subresources is an imperative style API where the client instructs the -kubelet to perform an action, but in general Kubernetes prefers declarative APIs -where the client declares a state for Kubernetes to enact. - -We could implement this in a declarative manner by creating a new -`EphemeralContainer` type: - -``` -type EphemeralContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - Spec v1.Container - Status v1.ContainerStatus -} -``` - -A new controller in the kubelet would watch for EphemeralContainers and -create/delete debug containers. `EphemeralContainer.Status` would be updated by -the kubelet at the same time it updates `ContainerStatus` for regular and init -containers. Clients would create a new `EphemeralContainer` object, wait for it -to be started and then attach using the pod's attach subresource and the name of -the `EphemeralContainer`. - -Debugging is inherently imperative, however, and not the a desired state to -describe. Once a Debug Container is started it should not be automatically -restarted, for example. A declarative API adds new states for the kubelet to -enforce, and SIG Node strongly prefers to minimize kubelet complexity. - -### Ephemeral Container Status - -The kubelet should be able to construct `PodStatus` without relying on prior -state, so we will store the Ephemeral Container's `Spec` and -`TargetContainerName` as runtime metadata. The kubelet persists container -metadata as CRI -[labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606) -and -[annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613). -The entire `v1.Container` used in the request will be serialized and stored as a -runtime annotation. The value of `TargetContainerName` will be stored as a -runtime label. Persisting this data in the runtime means it survives kubelet -restarts. - -At least for the Docker runtime, this is [an intended use of docker -labels](https://docs.docker.com/engine/userguide/labels-custom-metadata/#value-guidelines). -Docker does not document the maximum length of labels in its API. Empirically, -it supports up to the 64K constraint of the docker client's `bufio.Scanner` -size. We will conservatively limit the size of the spec to 32K and add a 32K -minimum label length test to runtime qualification. - -`EphemeralContainer.Status` is populated by the kubelet in the same way as -regular container statuses. The kubelet then updates the pod's status in the API -server using the pod's `/status` endpoint, which imposes no restrictions on -updates to `ephemeralContainers`. +The CRI requires no changes for basic functionality, but it will need to be +updated to support container namespace targeting, as described in the +[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace). ### Creating Debug Containers -1. `kubectl` constructs and `EphemeralContainer` based on command line - arguments and appends it to `Pod.Status.EphemeralContainers`. It `PUT`s the - modified pod to the pod's `/ephemeralcontainers`. +To create a debug container, kubectl will take the following steps: + +1. `kubectl` constructs an `EphemeralContainer` based on command line arguments + and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod + to the pod's `/ephemeralcontainers`. 1. The apiserver discards changes other than additions to - `Pod.Status.EphemeralContainers` and validates the pod update. - 1. Update discards `EphemeralContainer.Status` for new Ephemeral - Containers. + `Pod.Spec.EphemeralContainers` and validates the pod update. 1. Pod validation fails if container spec contains fields disallowed for Ephemeral Containers or the same name as a container in the spec or `EphemeralContainers`. @@ -286,8 +228,8 @@ updates to `ephemeralContainers`. 1. The kubelet's pod watcher notices the update and triggers a `syncPod()`. During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()` for any new Ephemeral Container. - 1. `StartEphemeralContainer()` uses the existing `startContainer()` method, - which gains support for targeting the namespaces of a container by name. + 1. `StartEphemeralContainer()` uses the existing `startContainer()` to + start the Ephemeral Container. 1. After initial creation, future invocations of `syncPod()` will publish its ContainerStatus but otherwise ignore the Ephemeral Container. It will exist for the life of the pod sandbox or it exits. In no event will @@ -312,31 +254,16 @@ terminal. This is supported by Docker. ### Killing Debug Containers -Debug containers will not be killed automatically unless the pod (specifically, -the pod sandbox) is destroyed. Debug Containers will stop when their command -exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug -Containers will not receive an EOF if their connection is interrupted. - -### Container Lifecycle Changes +Debug containers will not be killed automatically unless the pod is destroyed. +Debug Containers will stop when their command exits, such as exiting a shell. +Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if +their connection is interrupted. -Implementing debug requires no changes to the Container Runtime Interface as -it's the same operation as creating a regular container. The following changes -are necessary in the kubelet: - -1. `SyncPod()` must not kill any Debug Container even though it is not part of - the pod spec. -1. As an exception to the above, `SyncPod()` will kill Debug Containers when - the pod sandbox changes since a lone Debug Container in an abandoned sandbox - is not useful. Debug Containers are not started automatically in the new - sandbox. -1. `convertStatusToAPIStatus()` must sort Debug Containers status into - `EphemeralContainer.Status` similar to as it does for - `InitContainerStatuses` -1. Debug Containers must be excluded from calculation of pod phase and - condition - -`KillPod()` already operates on all running containers returned by the runtime -and requires no changes +A future improvement to Ephemeral Containers could allow killing Debug +Containers when they're removed the `EphemeralContainers`, but it's not clear +that we want to allow this. Removing an Ephemeral Container spec makes it +unavailable for future authorization decisions (e.g. whether to authorize exec +in a pod that had a privileged Ephemeral Container). ### Security Considerations @@ -344,9 +271,8 @@ Debug Containers have no additional privileges above what is available to any `v1.Container`. It's the equivalent of configuring an shell container in a pod spec except that it is created on demand. -Admission plugins must be updated to guard `/ephemeralcontainers`. In -particular, they should enforce the same container image policy on the -`EphemeralContainer.Spec` parameter as is enforced for regular containers. +Admission plugins must be updated to guard `/ephemeralcontainers`. They should +apply the same container image and security policy as for regular containers. ### Additional Consideration @@ -356,70 +282,33 @@ particular, they should enforce the same container image policy on the troubleshooting causes a pod to exceed its resource limit it may be evicted. 1. There's an output stream race inherent to creating then attaching a container which causes output generated between the start and attach to go - to the log rather than the client. This is not specific to Debug Containers - and exists because Kubernetes has no mechanism to attach a container prior - to starting it. This larger issue will not be addressed by Debug Containers, - but Debug Containers would benefit from future improvements or work arounds. -1. Debug Containers should not be used to build services, which we've attempted - to reflect in the API. -1. If a pod is configured with isolated PID namespaces, the Debug Container - will join the PID namespace of the target container. Debug Containers will - not be available with runtimes that do not implement PID namespace sharing. + to the log rather than the client. This is not specific to Ephemeral + Containers and exists because Kubernetes has no mechanism to attach a + container prior to starting it. This larger issue will not be addressed by + Ephemeral Containers, but Ephemeral Containers would benefit from future + improvements or work arounds. +1. Ephemeral Containers should not be used to build services, which we've + attempted to reflect in the API. ## Implementation Plan -### Alpha Release - -#### Goals and Non-Goals for Alpha Release +### 1.12: Initial Alpha Release -We're targeting an alpha release in Kubernetes 1.11 that includes the following +We're targeting an alpha release in Kubernetes 1.12 that includes the following basic functionality: -* Support in the kubelet for creating debug containers in a running pod -* A `kubectl alpha debug` command to initiate a debug container -* `kubectl describe pod` will list status of debug containers running in a pod - -Functionality will be hidden behind an alpha feature flag and disabled by -default. - -#### Kubernetes API Changes - -The following changes must be implemented in the API: - -1. `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as - described above. -1. The new subresource will be added to the pods API. -1. The API server must check for Ephemeral Containers when validating `attach`. +1. Approval for basic core API changes to Pod +1. Basic support in the kubelet for creating Ephemeral Containers -#### kubelet Implementation +Functionality out of scope for 1.12: -Debug Containers are implemented in the kubelet's generic runtime manager. -Performing this operation with a legacy (non-CRI) runtime will result in a not -implemented error. Implementation in the kubelet will be split into the -following steps: +* Killing running Ephemeral Containers by removing them from the Pod Spec. +* Updating `pod.Spec.EphemeralContainers` when containers are garbage + collected. +* `kubectl` commands for creating Ephemeral Containers -1. New container metadata `ContainerType`, `ContainerSpec` & - `TargetContainerName` is stored using CRI labels and annotations. - `kubecontainer.ContainerStatus` will be extended with a `ContainerType` - field (possible values: `REGULAR`, `INIT` & `EPHEMERAL`) so a container can - be identified as a debug container. -1. `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the - existing `startContainer()`. -1. `syncPod()` will call `StartEphemeralContainer()` to start the Debug - Container. The existing `generateAPIPodStatus()` will be updated to also - populate `EphemeralContainers.Status`. - -#### kubectl changes - -In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a -`kubectl alpha` command to contain alpha features. We will add `kubectl alpha -debug` to invoke Debug Containers. `kubectl` does not use feature gates, so -`kubectl alpha debug` will be visible by default in `kubectl` 1.11 and return an -error when used on a cluster with the feature disabled. - -`kubectl describe pod` will report the contents of `EphemeralContainers` when -not empty as it means the feature is enabled. The field will be hidden when -empty. +Functionality will be hidden behind an alpha feature flag and disabled by +default. ## Appendices @@ -550,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command is run. **Respect admission restrictions.** Requests from kubectl are proxied through -the apiserver and so are available to existing [admission -controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins -already exist to intercept `exec` and `attach` calls, but extending this to -support `debug` has not yet been scoped. +the apiserver and so are available to existing +[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/). +Plugins already exist to intercept `exec` and `attach` calls, but extending this +to support `debug` has not yet been scoped. **Allow introspection of pod state using existing tools**. The list of `EphemeralContainerStatuses` is never truncated. If a debug container has run in @@ -587,26 +476,146 @@ active debug container. ### Appendix 3: Alternatives Considered -#### Mutable Pod Spec +#### Container Spec in PodStatus + +Originally there was a desire to keep the pod spec immutable, so we explored +modifying only the pod status. An `EphemeralContainer` would contain a Spec, a +Status and a Target: + +``` +// EphemeralContainer describes a container to attach to a running pod for troubleshooting. +type EphemeralContainer struct { + metav1.TypeMeta `json:",inline"` + + // Spec describes the Ephemeral Container to be created. + Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` + + // Most recently observed status of the container. + // This data may not be up to date. + // Populated by the system. + // Read-only. + // +optional + Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` + + // If set, the name of the container from PodSpec that this ephemeral container targets. + // If not set then the ephemeral container is run in whatever namespaces are shared + // for the pod. + TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"` +} +``` + +Ephemeral Containers for a pod would be listed in the pod's status: + +``` +type PodStatus struct { + ... + // List of user-initiated ephemeral containers that have been run in this pod. + // +optional + EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` + +} +``` + +To create a new Ephemeral Container, one would append a new `EphemeralContainer` +with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` +in the API. Users cannot normally modify the pod status, so we'd create a new +subresource `/ephemeralcontainers` that allows an update of solely +`EphemeralContainers` and enforces append-only semantics. + +Since we have a requirement to describe the Ephemeral Container with a +`v1.Container`, this lead to a "spec in status" that seemed to violate API best +practices. It was confusing, and it required added complexity in the kubelet to +persist and publish user intent, which is rightfully the job of the apiserver. + +#### Extend the Existing Exec API ("exec++") + +A simpler change is to extend `v1.Pod`'s `/exec` subresource to support +"executing" container images. The current `/exec` endpoint must implement `GET` +to support streaming for all clients. We don't want to encode a (potentially +large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions` +with the specific fields required for creating a Debug Container: + +``` +// PodExecOptions is the query options to a Pod's remote exec call +type PodExecOptions struct { + ... + // EphemeralContainerName is the name of an ephemeral container in which the + // command ought to be run. Either both EphemeralContainerName and + // EphemeralContainerImage fields must be set, or neither. + EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...` + + // EphemeralContainerImage is the image of an ephemeral container in which the command + // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage + // fields must be set, or neither. + EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...` +} +``` + +After creating the Ephemeral Container, the kubelet would upgrade the connection +to streaming and perform an attach to the container's console. If disconnected, +the Ephemeral Container could be reattached using the pod's `/attach` endpoint +with `EphemeralContainerName`. + +Ephemeral Containers could not be removed via the API and instead the process +must terminate. While not ideal, this parallels existing behavior of `kubectl +exec`. To kill an Ephemeral Container one would `attach` and exit the process +interactively or create a new Ephemeral Container to send a signal with +`kill(1)` to the original process. + +Since the user cannot specify the `v1.Container`, this approach sacrifices a +great deal of flexibility. This solution still requires the kubelet to publish a +`Container` spec in the `PodStatus` that can be examined for future admission +decisions and so retains many of the downsides of the Container Spec in +PodStatus approach. + +#### Ephemeral Container Controller + +Kubernetes prefers declarative APIs where the client declares a state for +Kubernetes to enact. We could implement this in a declarative manner by creating +a new `EphemeralContainer` type: + +``` +type EphemeralContainer struct { + metav1.TypeMeta + metav1.ObjectMeta + + Spec v1.Container + Status v1.ContainerStatus +} +``` + +A new controller in the kubelet would watch for EphemeralContainers and +create/delete debug containers. `EphemeralContainer.Status` would be updated by +the kubelet at the same time it updates `ContainerStatus` for regular and init +containers. Clients would create a new `EphemeralContainer` object, wait for it +to be started and then attach using the pod's attach subresource and the name of +the `EphemeralContainer`. + +A new controller is a significant amount of complexity to add to the kubelet, +especially considering that the kubelet is already watching for changes to pods. +The kubelet would have to be modified to create containers in a pod from +multiple config sources. SIG Node strongly prefers to minimize kubelet +complexity. + +#### Mutable Pod Spec Containers -Rather than adding an operation to have Kubernetes attach a pod we could instead -make the pod spec mutable so the client can generate an update adding a -container. `SyncPod()` has no issues adding the container to the pod at that -point, but an immutable pod spec has been a basic assumption in Kubernetes thus -far and changing it carries risk. It's preferable to keep the pod spec immutable -as a best practice. +Rather than adding to the pod API, we could instead make the pod spec mutable so +the client can generate an update adding a container. `SyncPod()` has no issues +adding the container to the pod at that point, but an immutable pod spec has +been a basic assumption and best practice in Kubernetes. Changing this +assumption complicates the requirements of the kubelet state machine. Since the +kubelet was not written with this in mind, we should expect such a change would +create bugs we cannot predict. -#### Ephemeral container +#### Image Exec -An earlier version of this proposal suggested running an ephemeral container in -the pod namespaces. The container would not be added to the pod spec and would -exist only as long as the process it ran. This has the advantage of behaving -similarly to the current kubectl exec, but it is opaque and likely violates -design assumptions. We could add constructs to track and report on both -traditional exec process and exec containers, but this would probably be more -work than adding to the pod spec. Both are generally useful, and neither -precludes the other in the future, so we chose mutating the pod spec for -expedience. +An earlier version of this proposal suggested simply adding `Image` parameter to +the exec API. This would run an ephemeral container in the pod namespaces +without adding it to the pod spec or status. This container would exist only as +long as the process it ran. This parallels the current kubectl exec, including +its lack of transparency. We could add constructs to track and report on both +traditional exec process and exec containers. In the end this failed to meet our +transparency requirements. #### Attaching Container Type Volume @@ -627,9 +636,8 @@ this simplifies the solution by working within the existing constraints of If Kubernetes supported the concept of an "inactive" container, we could configure it as part of a pod and activate it at debug time. In order to avoid coupling the debug tool versions with those of the running containers, we would -need to ensure the debug image was pulled at debug time. The container could -then be run with a TTY and attached using kubectl. We would need to figure out a -solution that allows access the filesystem of other containers. +want to ensure the debug image was pulled at debug time. The container could +then be run with a TTY and attached using kubectl. The downside of this approach is that it requires prior configuration. In addition to requiring prior consideration, it would increase boilerplate config. @@ -639,14 +647,14 @@ than a feature of the platform. #### Implicit Empty Volume Kubernetes could implicitly create an EmptyDir volume for every pod which would -then be available as target for either the kubelet or a sidecar to extract a +then be available as a target for either the kubelet or a sidecar to extract a package of binaries. Users would have to be responsible for hosting a package build and distribution infrastructure or rely on a public one. The complexity of this solution makes it undesirable. -#### Standalone Pod in Shared Namespace +#### Standalone Pod in Shared Namespace ("Debug Pod") Rather than inserting a new container into a pod namespace, Kubernetes could instead support creating a new pod with container namespaces shared with @@ -656,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces (network, pid, etc) of the target pod but remain in a separate resource group (e.g. cgroup for container-based runtimes). -This would be a rather fundamental change to pod, which is currently treated as -an atomic unit. The Container Runtime Interface has no provisions for sharing +This would be a rather large change for pod, which is currently treated as an +atomic unit. The Container Runtime Interface has no provisions for sharing outside of a pod sandbox and would need a refactor. This could be a complicated change for non-container runtimes (e.g. hypervisor runtimes) which have more rigid boundaries between pods. -Effectively, Debug Pod must be implemented by the runtimes while Debug -Containers are implemented by the kubelet. Minimizing change to the Kubernetes -API is not worth the increased complexity for the kubelet and runtimes. +This is pushing the complexity of the solution from the kubelet to the runtimes. +Minimizing change to the Kubernetes API is not worth the increased complexity +for the kubelet and runtimes. It could also be possible to implement a Debug Pod as a privileged pod that runs in the host namespace and interacts with the runtime directly to run a new container in the appropriate namespace. This solution would be runtime-specific -and effectively pushes the complexity of debugging to the user. Additionally, -requiring node-level access to debug a pod does not meet our requirements. +and pushes the complexity of debugging to the user. Additionally, requiring +node-level access to debug a pod does not meet our requirements. #### Exec from Node