diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md index 72c1cb77321..3fcc0223125 100644 --- a/contributors/design-proposals/node/troubleshoot-running-pods.md +++ b/contributors/design-proposals/node/troubleshoot-running-pods.md @@ -1,6 +1,6 @@ # Troubleshoot Running Pods -* Status: Pending +* Status: Implementing * Version: Alpha * Implementation Owner: @verb @@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an execution platform for custom binaries produced by a build system. These users can forgo the scripted OS install of traditional Dockerfiles and instead `COPY` the output of their build system into a container image built `FROM scratch` or -a [distroless container -image](https://github.com/GoogleCloudPlatform/distroless). This confers several -advantages: +a +[distroless container image](https://github.com/GoogleCloudPlatform/distroless). +This confers several advantages: 1. **Minimal images** lower operational burden and reduce attack vectors. 1. **Immutable images** improve correctness and reliability. @@ -45,26 +45,25 @@ A solution to troubleshoot arbitrary container images MUST: * fetch troubleshooting utilities at debug time rather than at the time of pod creation * be compatible with admission controllers and audit logging -* allow discovery of debugging status +* allow discovery of current debugging status * support arbitrary runtimes via the CRI (possibly with reduced feature set) * require no administrative access to the node * have an excellent user experience (i.e. should be a feature of the platform rather than config-time trickery) -* have no *inherent* side effects to the running container image +* have no _inherent_ side effects to the running container image +* v1.Container must be available for inspection by admission controllers ## Feature Summary Any new debugging functionality will require training users. We can ease the transition by building on an existing usage pattern. We will create a new command, `kubectl debug`, which parallels an existing command, `kubectl exec`. -Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will -be similar but run a *container* in a *pod*. +Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will +be similar but run a _container_ in a _pod_. -A container created by `kubectl debug` is a *Debug Container*. Just like a -process run by `kubectl exec`, a Debug Container is not part of the pod spec and -has no resource stored in the API. Unlike `kubectl exec`, a Debug Container -*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl -describe pod`. +A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl +exec`, Debug Containers have status that is reported in `PodStatus` and +displayed by `kubectl describe pod`. For example, the following command would attach to a newly created container in a pod: @@ -82,22 +81,16 @@ kubectl debug target-pod This creates an interactive shell in a pod which can examine and signal other processes in the pod. It has access to the same network and IPC as processes in -the pod. It can access the filesystem of other processes by `/proc/$PID/root`. -As is already the case with regular containers, Debug Containers can enter -arbitrary namespaces of another container via `nsenter` when run with -`CAP_SYS_ADMIN`. +the pod. When [process namespace sharing](https://features.k8s.io/495) is +enabled, it can access the filesystem of other processes by `/proc/$PID/root`. +Debug Containers can enter arbitrary namespaces of another visible container via +`nsenter` when run with `CAP_SYS_ADMIN`. -*Please see the User Stories section for additional examples and Alternatives -Considered for the considerable list of other solutions we considered.* +_Please see the User Stories section for additional examples and Alternatives +Considered for the considerable list of other solutions we considered._ ## Implementation Details -The implementation of `kubectl debug` closely mirrors the implementation of -`kubectl exec`, with most of the complexity implemented in the `kubelet`. How -functionality like this best fits into Kubernetes API has been contentious. In -order to make progress, we will start with the smallest possible API change, -extending `/exec` to support Debug Containers, and iterate. - From the perspective of the user, there's a new command, `kubectl debug`, that creates a Debug Container and attaches to its console. We believe a new command will be less confusing for users than overloading `kubectl exec` with a new @@ -106,192 +99,154 @@ subsequently be used to reattach and is reported by `kubectl describe`. ### Kubernetes API Changes -#### Chosen Solution: "exec++" - -We will extend `v1.Pod`'s `/exec` subresource to support "executing" container -images. The current `/exec` endpoint must implement `GET` to support streaming -for all clients. We don't want to encode a (potentially large) `v1.Container` as -an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific -fields required for creating a Debug Container: - -``` -// PodExecOptions is the query options to a Pod's remote exec call -type PodExecOptions struct { - ... - // EphemeralContainerName is the name of an ephemeral container in which the - // command ought to be run. Either both EphemeralContainerName and - // EphemeralContainerImage fields must be set, or neither. - EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...` - - // EphemeralContainerImage is the image of an ephemeral container in which the command - // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage - // fields must be set, or neither. - EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...` -} -``` - -After creating the Debug Container, the kubelet will upgrade the connection to -streaming and perform an attach to the container's console. If disconnected, the -Debug Container can be reattached using the pod's `/attach` endpoint with -`EphemeralContainerName`. - -Debug Containers cannot be removed via the API and instead the process must -terminate. While not ideal, this parallels existing behavior of `kubectl exec`. -To kill a Debug Container one would `attach` and exit the process interactively -or create a new Debug Container to send a signal with `kill(1)` to the original -process. +This will be implemented in the Core API to avoid new dependencies in the +kubelet. The user-level concept of a _Debug Container_ implemented with the +API-level concept of an _Ephemeral Container_. The API doesn't require an +Ephemeral Container to be used as a Debug Container. It's intended as a general +purpose construct for running a short-lived process in a pod. -#### Alternative 1: Debug Subresource +#### Pod Changes -Rather than extending an existing subresource, we could create a new, -non-streaming `debug` subresource. We would create a new API Object: +Ephemeral Containers are represented in `PodSpec` and `PodStatus`: ``` -// DebugContainer describes a container to attach to a running pod for troubleshooting. -type DebugContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - // Name is the name of the Debug Container. Its presence will cause - // exec to create a Debug Container rather than performing a runtime exec. - Name string `json:"name,omitempty" ...` - - // Image is an optional container image name that will be used to for the Debug - // Container in the specified Pod with Command as ENTRYPOINT. If omitted a - // default image will be used. - Image string `json:"image,omitempty" ...` +type PodSpec struct { + ... + // List of user-initiated ephemeral containers to run in this pod. + // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature. + // +optional + EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"` } -``` - -The pod would gain a new `/debug` subresource that allows the following: -1. A `POST` of a `PodDebugContainer` to - `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug - Container named `$NAME` running in pod `$POD_NAME`. -1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop - the Debug Container `$NAME` in pod `$POD_NAME`. - -Once created, a client would attach to the console of a debug container using -the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. - -However, this pattern does not resemble any other current usage of the API, so -we prefer to start with exec++ and reevaluate if we discover a compelling -reason. - -#### Alternative 2: Declarative Configuration - -Using subresources is an imperative style API where the client instructs the -kubelet to perform an action, but in general Kubernetes prefers declarative APIs -where the client declares a state for Kubernetes to enact. - -We could implement this in a declarative manner by creating a new -`EphemeralContainer` type: - -``` -type EphemeralContainer struct { - metav1.TypeMeta - metav1.ObjectMeta - - Spec EphemeralContainerSpec - Status v1.ContainerStatus -} -``` - -`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields -relevant to Debug Containers: - -``` -type EphemeralContainerSpec struct { - // Target is the pod in which to run the EphemeralContainer - // Required. - Target v1.ObjectReference - - Name string - Image String - Command []string - Args []string - ImagePullPolicy PullPolicy - SecurityContext *SecurityContext +type PodStatus struct { + ... + // Status for any Ephemeral Containers that running in this pod. + // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature. + // +optional + EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"` } ``` -A new controller in the kubelet would watch for EphemeralContainers and -create/delete debug containers. `EphemeralContainer.Status` would be updated by -the kubelet at the same time it updates `ContainerStatus` for regular and init -containers. Clients would create a new `EphemeralContainer` object, wait for it -to be started and then attach using the pod's attach subresource and the name of -the `EphemeralContainer`. - -Debugging is inherently imperative, however, rather than a state for Kubernetes -to enforce. Once a Debug Container is started it should not be automatically -restarted, for example. This solution imposes additionally complexity and -dependencies on the kubelet, but it's not yet clear if the complexity is -justified. - -### Debug Container Status - -The status of a Debug Container is reported in a new field in `v1.PodStatus`: +`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and +`InitContainerStatuses`, but `EphemeralContainers` introduces a new type: ``` -type PodStatus struct { - ... - EphemeralContainerStatuses []v1.ContainerStatus +// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions +// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag. +type EphemeralContainer struct { + // Spec describes the Ephemeral Container to be created. + Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"` + + // If set, the name of the container from PodSpec that this ephemeral container targets. + // The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container. + // If not set then the ephemeral container is run in whatever namespaces are shared + // for the pod. + // +optional + TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"` } ``` -This status is only populated for Debug Containers, but there's interest in -tracking status for traditional exec in a similar manner. - -Note that `Command` and `Args` would have to be tracked in the status object -because there is no spec for Debug Containers or exec. These must either be made -available by the runtime or tracked by the kubelet. For Debug Containers this -could be stored as runtime labels, but the kubelet currently has no method of -storing state across restarts for exec. Solving this problem for exec is out of -scope for Debug Containers, but we will look for a solution as we implement this -feature. - -`EphemeralContainerStatuses` is populated by the kubelet in the same way as -regular and init container statuses. This is sent to the API server and -displayed by `kubectl describe pod`. +Much of the utility of Ephemeral Containers comes from the ability to run a +container within the PID namespace of another container. `TargetContainerName` +allows targeting a container that doesn't share its PID namespace with the rest +of the pod. We must modify the CRI to enable this functionality (see below). + +##### Alternative Considered: Omitting TargetContainerName + +It would be simpler for the API, kubelet and kubectl if `EphemeralContainers` +was a `[]Container`, but as isolated PID namespaces will be the default for some +time, being able to target a container will provide a better user experience. + +#### Updates + +Most fields of `Pod.Spec` are immutable once created. There is a short whitelist +of fields which may be updated, and we could extend this to include +`EphemeralContainers`. The ability to add new containers is a large change for +Pod, however, and we'd like to begin conservatively by enforcing the following +best practices: + +1. Ephemeral Containers lack guarantees for resources or execution, and they + will never be automatically restarted. To avoid pods that depend on + Ephemeral Containers, we allow their addition only in pod updates and + disallow them during pod create. +1. Some fields of `v1.Container` imply a fundamental role in a pod. We will + disallow the following fields in Ephemeral Containers: `resources`, `ports`, + `livenessProbe`, `readinessProbe`, and `lifecycle.` +1. Cluster administrators may want to restrict access to Ephemeral Containers + independent of other pod updates. + +To enforce these restrictions and new permissions, we will introduce a new Pod +subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified +via this subresource. `EphemeralContainerStatuses` is updated with everything +else in `Pod.Status` via `/status`. + +To create a new Ephemeral Container, one appends a new `EphemeralContainer` with +the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and +`PUT`s the pod to `/ephemeralcontainers`. + +The subresources `attach`, `exec`, `log`, and `portforward` are available for +Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl +attach`, `kubelet exec`, `kubectl log`, and `kubectl port-forward` will work for +Ephemeral Containers. + +Once the pod is updated, the kubelet worker watching this pod will launch the +Ephemeral Container and update its status. The client is expected to watch for +the creation of the container status and then attach to the console of a debug +container using the existing attach endpoint, +`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new +container occurring between its creation and attach will not be replayed, but it +can be viewed using `kubectl log`. + +##### Alternative Considered: Standard Pod Updates + +It would simplify initial implementation if we updated the pod spec via the +normal means, and switched to a new update subresource if required at a future +date. It's easier to begin with a too-restrictive policy than a too-permissive +one on which users come to rely, and we expect to be able to remove the +`/ephemeralcontainers` subresource prior to exiting alpha should it prove +unnecessary. + +### Container Runtime Interface (CRI) changes + +The CRI requires no changes for basic functionality, but it will need to be +updated to support container namespace targeting, as described in the +[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace). ### Creating Debug Containers -1. `kubectl` invokes the exec API as described in the preceding section. -1. The API server checks for name collisions with existing containers, performs - admission control and proxies the connection to the kubelet's - `/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint. -1. The kubelet instructs the Runtime Manager to create a Debug Container. -1. The runtime manager uses the existing `startContainer()` method to create a - container in an existing pod. `startContainer()` has one modification for - Debug Containers: it creates a new runtime label (e.g. a docker label) that - identifies this container as a Debug Container. -1. After creating the container, the kubelet schedules an asynchronous update - of `PodStatus`. The update publishes the debug container status to the API - server at which point the Debug Container becomes visible via `kubectl - describe pod`. -1. The kubelet will upgrade the connection to streaming and attach to the - container's console. - -Rather than performing the implicit attach the kubelet could return success to -the client and require the client to perform an explicit attach, but the -implicit attach maintains consistent semantics across `/exec` rather than -varying behavior based on parameters. - -The apiserver detects container name collisions with both containers in the pod -spec and other running Debug Containers by checking -`EphemeralContainerStatuses`. In a race to create two Debug Containers with the -same name, the API server will pass both requests and the kubelet must return an -error to all but one request. +To create a debug container, kubectl will take the following steps: + +1. `kubectl` constructs an `EphemeralContainer` based on command line arguments + and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod + to the pod's `/ephemeralcontainers`. +1. The apiserver discards changes other than additions to + `Pod.Spec.EphemeralContainers` and validates the pod update. + 1. Pod validation fails if container spec contains fields disallowed for + Ephemeral Containers or the same name as a container in the spec or + `EphemeralContainers`. + 1. API resource versioning resolves update races. +1. The kubelet's pod watcher notices the update and triggers a `syncPod()`. + During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()` + for any new Ephemeral Container. + 1. `StartEphemeralContainer()` uses the existing `startContainer()` to + start the Ephemeral Container. + 1. After initial creation, future invocations of `syncPod()` will publish + its ContainerStatus but otherwise ignore the Ephemeral Container. It + will exist for the life of the pod sandbox or it exits. In no event will + it be restarted. +1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which + includes the new `EphemeralContainer`) by its normal, existing means. +1. The client performs an attach to the debug container's console. There are no limits on the number of Debug Containers that can be created in a pod, but exceeding a pod's resource allocation may cause the pod to be evicted. ### Restarting and Reattaching Debug Containers -Debug Containers will never be restarted automatically. It is possible to -replace a Debug Container that has exited by re-using a Debug Container name. It -is an error to attempt to replace a Debug Container that is still running, which -is detected by both the API server and the kubelet. +Debug Containers will not be restarted. + +We want to be more user friendly by allowing re-use of the name of an exited +debug container, but this will be left for a future improvement. One can reattach to a Debug Container using `kubectl attach`. When supported by a runtime, multiple clients can attach to a single debug container and share the @@ -299,50 +254,25 @@ terminal. This is supported by Docker. ### Killing Debug Containers -Debug containers will not be killed automatically until the pod (specifically, -the pod sandbox) is destroyed. Debug Containers will stop when their command -exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug -Containers will not receive an EOF if their connection is interrupted. - -### Container Lifecycle Changes - -Implementing debug requires no changes to the Container Runtime Interface as -it's the same operation as creating a regular container. The following changes -are necessary in the kubelet: - -1. `SyncPod()` must not kill any Debug Container even though it is not part of - the pod spec. -1. As an exception to the above, `SyncPod()` will kill Debug Containers when - the pod sandbox changes since a lone Debug Container in an abandoned sandbox - is not useful. Debug Containers are not automatically started in the new - sandbox. -1. `convertStatusToAPIStatus()` must sort Debug Containers status into - `EphemeralContainerStatuses` similar to as it does for - `InitContainerStatuses` -1. The kubelet must preserve `ContainerStatus` on debug containers for - reporting. -1. Debug Containers must be excluded from calculation of pod phase and - condition - -It's worth noting some things that do not change: - -1. `KillPod()` already operates on all running containers returned by the - runtime. -1. Containers created prior to this feature being enabled will have a - `containerType` of `""`. Since this does not match `"EPHEMERAL"` the special - handling of Debug Containers is backwards compatible. +Debug containers will not be killed automatically unless the pod is destroyed. +Debug Containers will stop when their command exits, such as exiting a shell. +Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if +their connection is interrupted. + +A future improvement to Ephemeral Containers could allow killing Debug +Containers when they're removed the `EphemeralContainers`, but it's not clear +that we want to allow this. Removing an Ephemeral Container spec makes it +unavailable for future authorization decisions (e.g. whether to authorize exec +in a pod that had a privileged Ephemeral Container). ### Security Considerations Debug Containers have no additional privileges above what is available to any `v1.Container`. It's the equivalent of configuring an shell container in a pod -spec but created on demand. +spec except that it is created on demand. -Admission plugins that guard `/exec` must be updated for the new parameters. In -particular, they should enforce the same container image policy on the `Image` -parameter as is enforced for regular containers. During the alpha phase we will -additionally support a container image whitelist as a kubelet flag to allow -cluster administrators to easily constraint debug container images. +Admission plugins must be updated to guard `/ephemeralcontainers`. They should +apply the same container image and security policy as for regular containers. ### Additional Consideration @@ -352,116 +282,33 @@ cluster administrators to easily constraint debug container images. troubleshooting causes a pod to exceed its resource limit it may be evicted. 1. There's an output stream race inherent to creating then attaching a container which causes output generated between the start and attach to go - to the log rather than the client. This is not specific to Debug Containers - and exists because Kubernetes has no mechanism to attach a container prior - to starting it. This larger issue will not be addressed by Debug Containers, - but Debug Containers would benefit from future improvements or work arounds. -1. We do not want to describe Debug Containers using `v1.Container`. This is to - reinforce that Debug Containers are not general purpose containers by - limiting their configurability. Debug Containers should not be used to build - services. -1. Debug Containers are of limited usefulness without a shared PID namespace. - If a pod is configured with isolated PID namespaces, the Debug Container - will join the PID namespace of the target container. Debug Containers will - not be available with runtimes that do not implement PID namespace sharing - in some form. + to the log rather than the client. This is not specific to Ephemeral + Containers and exists because Kubernetes has no mechanism to attach a + container prior to starting it. This larger issue will not be addressed by + Ephemeral Containers, but Ephemeral Containers would benefit from future + improvements or work arounds. +1. Ephemeral Containers should not be used to build services, which we've + attempted to reflect in the API. ## Implementation Plan -### Alpha Release - -#### Goals and Non-Goals for Alpha Release +### 1.12: Initial Alpha Release -We're targeting an alpha release in Kubernetes 1.9 that includes the following +We're targeting an alpha release in Kubernetes 1.12 that includes the following basic functionality: -* Support in the kubelet for creating debug containers in a running pod -* A `kubectl debug` command to initiate a debug container -* `kubectl describe pod` will list status of debug containers running in a pod - -Functionality will be hidden behind an alpha feature flag and disabled by -default. The following are explicitly out of scope for the 1.9 alpha release: - -* Exited Debug Containers will be garbage collected as regular containers and - may disappear from the list of Debug Container Statuses. -* Security Context for the Debug Container is not configurable. It will always - be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`. -* Image pull policy for the Debug Container is not configurable. It will - always be run with `PullAlways`. - -#### kubelet Implementation - -Debug Containers are implemented in the kubelet's generic runtime manager. -Performing this operation with a legacy (non-CRI) runtime will result in a not -implemented error. Implementation in the kubelet will be split into the -following steps: - -##### Step 1: Container Type - -The first step is to add a feature gate to ensure all changes are off by -default. This will be added in the `pkg/features` `DefaultFeatureGate`. - -The runtime manager stores metadata about containers in the runtime via labels -(e.g. docker labels). These labels are used to populate the fields of -`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug -Containers differently in a few situations, we must add a new piece of metadata -to distinguish Debug Containers from regular containers. +1. Approval for basic core API changes to Pod +1. Basic support in the kubelet for creating Ephemeral Containers -`startContainer()` will be updated to write a new label -`io.kubernetes.container.type` to the runtime. Existing containers will be -started with a type of `REGULAR` or `INIT`. When added in a subsequent step, -Debug Containers will start with the type `EPHEMERAL`. +Functionality out of scope for 1.12: -##### Step 2: Creation and Handling of Debug Containers +* Killing running Ephemeral Containers by removing them from the Pod Spec. +* Updating `pod.Spec.EphemeralContainers` when containers are garbage + collected. +* `kubectl` commands for creating Ephemeral Containers -This step adds methods for creating debug containers, but doesn't yet modify the -kubelet API. Since the runtime manager discards runtime (e.g. docker) labels -after populating `kubecontainer.ContainerStatus`, the label value will be stored -in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`. - -The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container` -and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented. -Currently only the Generic Runtime Manager (i.e. the CRI) implements the -`DebugContainerRunner` interface. - -The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to -create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug -Containers unless the sandbox is restarted. - -##### Step 3: kubelet API changes - -The kubelet exposes the new functionality in its existing `/exec/` endpoint. -`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls -`RunDebugContainer()`, and performs the attach. - -##### Step 4: Reporting EphemeralContainerStatus - -The last major change to the kubelet is to populate -v1.`PodStatus.EphemeralContainerStatuses` based on the -`kubecontainer.ContainerStatus` for the Debug Container. - -#### Kubernetes API Changes - -There are two changes to be made to the Kubernetes, which will be made -independently: - -1. `v1.PodExecOptions` must be extended with new fields. -1. `v1.PodStatus` gains a new field to hold Debug Container statuses. - -In all cases, new fields will be prepended with `Alpha` for the duration of this -feature's alpha status. - -#### kubectl changes - -In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a -`kubectl alpha` command to contain alpha features. We will add `kubectl alpha -debug` to invoke Debug Containers. `kubectl` does not use feature gates, so -`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an -error when used on a cluster with the feature disabled. - -`kubectl describe pod` will report the contents of `EphemeralContainerStatuses` -when not empty as it means the feature is enabled. The field will be hidden when -empty. +Functionality will be hidden behind an alpha feature flag and disabled by +default. ## Appendices @@ -592,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command is run. **Respect admission restrictions.** Requests from kubectl are proxied through -the apiserver and so are available to existing [admission -controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins -already exist to intercept `exec` and `attach` calls, but extending this to -support `debug` has not yet been scoped. +the apiserver and so are available to existing +[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/). +Plugins already exist to intercept `exec` and `attach` calls, but extending this +to support `debug` has not yet been scoped. **Allow introspection of pod state using existing tools**. The list of `EphemeralContainerStatuses` is never truncated. If a debug container has run in @@ -629,26 +476,146 @@ active debug container. ### Appendix 3: Alternatives Considered -#### Mutable Pod Spec +#### Container Spec in PodStatus + +Originally there was a desire to keep the pod spec immutable, so we explored +modifying only the pod status. An `EphemeralContainer` would contain a Spec, a +Status and a Target: + +``` +// EphemeralContainer describes a container to attach to a running pod for troubleshooting. +type EphemeralContainer struct { + metav1.TypeMeta `json:",inline"` + + // Spec describes the Ephemeral Container to be created. + Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"` + + // Most recently observed status of the container. + // This data may not be up to date. + // Populated by the system. + // Read-only. + // +optional + Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"` + + // If set, the name of the container from PodSpec that this ephemeral container targets. + // If not set then the ephemeral container is run in whatever namespaces are shared + // for the pod. + TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"` +} +``` + +Ephemeral Containers for a pod would be listed in the pod's status: + +``` +type PodStatus struct { + ... + // List of user-initiated ephemeral containers that have been run in this pod. + // +optional + EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"` + +} +``` + +To create a new Ephemeral Container, one would append a new `EphemeralContainer` +with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` +in the API. Users cannot normally modify the pod status, so we'd create a new +subresource `/ephemeralcontainers` that allows an update of solely +`EphemeralContainers` and enforces append-only semantics. + +Since we have a requirement to describe the Ephemeral Container with a +`v1.Container`, this lead to a "spec in status" that seemed to violate API best +practices. It was confusing, and it required added complexity in the kubelet to +persist and publish user intent, which is rightfully the job of the apiserver. + +#### Extend the Existing Exec API ("exec++") + +A simpler change is to extend `v1.Pod`'s `/exec` subresource to support +"executing" container images. The current `/exec` endpoint must implement `GET` +to support streaming for all clients. We don't want to encode a (potentially +large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions` +with the specific fields required for creating a Debug Container: + +``` +// PodExecOptions is the query options to a Pod's remote exec call +type PodExecOptions struct { + ... + // EphemeralContainerName is the name of an ephemeral container in which the + // command ought to be run. Either both EphemeralContainerName and + // EphemeralContainerImage fields must be set, or neither. + EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...` + + // EphemeralContainerImage is the image of an ephemeral container in which the command + // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage + // fields must be set, or neither. + EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...` +} +``` + +After creating the Ephemeral Container, the kubelet would upgrade the connection +to streaming and perform an attach to the container's console. If disconnected, +the Ephemeral Container could be reattached using the pod's `/attach` endpoint +with `EphemeralContainerName`. + +Ephemeral Containers could not be removed via the API and instead the process +must terminate. While not ideal, this parallels existing behavior of `kubectl +exec`. To kill an Ephemeral Container one would `attach` and exit the process +interactively or create a new Ephemeral Container to send a signal with +`kill(1)` to the original process. + +Since the user cannot specify the `v1.Container`, this approach sacrifices a +great deal of flexibility. This solution still requires the kubelet to publish a +`Container` spec in the `PodStatus` that can be examined for future admission +decisions and so retains many of the downsides of the Container Spec in +PodStatus approach. + +#### Ephemeral Container Controller + +Kubernetes prefers declarative APIs where the client declares a state for +Kubernetes to enact. We could implement this in a declarative manner by creating +a new `EphemeralContainer` type: + +``` +type EphemeralContainer struct { + metav1.TypeMeta + metav1.ObjectMeta + + Spec v1.Container + Status v1.ContainerStatus +} +``` + +A new controller in the kubelet would watch for EphemeralContainers and +create/delete debug containers. `EphemeralContainer.Status` would be updated by +the kubelet at the same time it updates `ContainerStatus` for regular and init +containers. Clients would create a new `EphemeralContainer` object, wait for it +to be started and then attach using the pod's attach subresource and the name of +the `EphemeralContainer`. + +A new controller is a significant amount of complexity to add to the kubelet, +especially considering that the kubelet is already watching for changes to pods. +The kubelet would have to be modified to create containers in a pod from +multiple config sources. SIG Node strongly prefers to minimize kubelet +complexity. + +#### Mutable Pod Spec Containers -Rather than adding an operation to have Kubernetes attach a pod we could instead -make the pod spec mutable so the client can generate an update adding a -container. `SyncPod()` has no issues adding the container to the pod at that -point, but an immutable pod spec has been a basic assumption in Kubernetes thus -far and changing it carries risk. It's preferable to keep the pod spec immutable -as a best practice. +Rather than adding to the pod API, we could instead make the pod spec mutable so +the client can generate an update adding a container. `SyncPod()` has no issues +adding the container to the pod at that point, but an immutable pod spec has +been a basic assumption and best practice in Kubernetes. Changing this +assumption complicates the requirements of the kubelet state machine. Since the +kubelet was not written with this in mind, we should expect such a change would +create bugs we cannot predict. -#### Ephemeral container +#### Image Exec -An earlier version of this proposal suggested running an ephemeral container in -the pod namespaces. The container would not be added to the pod spec and would -exist only as long as the process it ran. This has the advantage of behaving -similarly to the current kubectl exec, but it is opaque and likely violates -design assumptions. We could add constructs to track and report on both -traditional exec process and exec containers, but this would probably be more -work than adding to the pod spec. Both are generally useful, and neither -precludes the other in the future, so we chose mutating the pod spec for -expedience. +An earlier version of this proposal suggested simply adding `Image` parameter to +the exec API. This would run an ephemeral container in the pod namespaces +without adding it to the pod spec or status. This container would exist only as +long as the process it ran. This parallels the current kubectl exec, including +its lack of transparency. We could add constructs to track and report on both +traditional exec process and exec containers. In the end this failed to meet our +transparency requirements. #### Attaching Container Type Volume @@ -669,9 +636,8 @@ this simplifies the solution by working within the existing constraints of If Kubernetes supported the concept of an "inactive" container, we could configure it as part of a pod and activate it at debug time. In order to avoid coupling the debug tool versions with those of the running containers, we would -need to ensure the debug image was pulled at debug time. The container could -then be run with a TTY and attached using kubectl. We would need to figure out a -solution that allows access the filesystem of other containers. +want to ensure the debug image was pulled at debug time. The container could +then be run with a TTY and attached using kubectl. The downside of this approach is that it requires prior configuration. In addition to requiring prior consideration, it would increase boilerplate config. @@ -681,14 +647,14 @@ than a feature of the platform. #### Implicit Empty Volume Kubernetes could implicitly create an EmptyDir volume for every pod which would -then be available as target for either the kubelet or a sidecar to extract a +then be available as a target for either the kubelet or a sidecar to extract a package of binaries. Users would have to be responsible for hosting a package build and distribution infrastructure or rely on a public one. The complexity of this solution makes it undesirable. -#### Standalone Pod in Shared Namespace +#### Standalone Pod in Shared Namespace ("Debug Pod") Rather than inserting a new container into a pod namespace, Kubernetes could instead support creating a new pod with container namespaces shared with @@ -698,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces (network, pid, etc) of the target pod but remain in a separate resource group (e.g. cgroup for container-based runtimes). -This would be a rather fundamental change to pod, which is currently treated as -an atomic unit. The Container Runtime Interface has no provisions for sharing +This would be a rather large change for pod, which is currently treated as an +atomic unit. The Container Runtime Interface has no provisions for sharing outside of a pod sandbox and would need a refactor. This could be a complicated change for non-container runtimes (e.g. hypervisor runtimes) which have more rigid boundaries between pods. -Effectively, Debug Pod must be implemented by the runtimes while Debug -Containers are implemented by the kubelet. Minimizing change to the Kubernetes -API is not worth the increased complexity for the kubelet and runtimes. +This is pushing the complexity of the solution from the kubelet to the runtimes. +Minimizing change to the Kubernetes API is not worth the increased complexity +for the kubelet and runtimes. It could also be possible to implement a Debug Pod as a privileged pod that runs in the host namespace and interacts with the runtime directly to run a new container in the appropriate namespace. This solution would be runtime-specific -and effectively pushes the complexity of debugging to the user. Additionally, -requiring node-level access to debug a pod does not meet our requirements. +and pushes the complexity of debugging to the user. Additionally, requiring +node-level access to debug a pod does not meet our requirements. #### Exec from Node @@ -729,8 +695,7 @@ coupling it with container images. * [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140) * [CRI Tracking Issue](https://issues.k8s.io/28789) * [CRI: expose optional runtime features](https://issues.k8s.io/32803) -* [Resource QoS in - Kubernetes](resource-qos.md) +* [Resource QoS in Kubernetes](resource-qos.md) * Related Features * [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across containers in a pod