containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

Secretions · 2023-01-25T20:35:05Z

What happened:

Started two EKS 1.24 nodes:
- one based on the standard EKS node AMI
- the other based on the EKS GPU node AMI
Get a terminal on each node
Cat active config in /etc/containerd/config.toml
A CRI registry plugin block is present on the EKS node AMI
No such block is present on the EKS GPU node AMI
Daemonset propagates /etc/docker/certs.d with cert identically on both nodes
Attempts by pods to pull images from private registry backed by the configured CA cert...
- Worked on the standard EKS node
- Failed on the GPU node due to certificate validation failures

What you expected to happen:

The CRI registry plugin should be configured on the GPU node with the same paths as the standard EKS node
With certs propagated identically across nodes, both nodes should successfully pull from the same registry

How to reproduce it (as minimally and precisely as possible):

Start a GPU node
Get a terminal on the node
Check /etc/containerd/config.toml
Notice lack of CRI registry plugin block

Anything else we need to know?:

A PR previously added a config for containerd to support configuring registry certificates in both the classic docker path and the containerd one: #1049

This is the config block that was added:

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

This configuration is present on the standard EKS node AMI, but not in any variation on the GPU node AMI. My understanding is that the GPU AMI doesn't use the containerd configs in this repo, and that there is some Amazon internal repo that is the source of truth for the GPU version of the containerd configs. If that is correct, then this internal repo needs to also be updated with a corresponding change to achieve feature parity with the standard EKS ami.

Environment:

AWS Region: us-west-2
Instance Type(s): g4dn.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.3
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.24
AMI Version: ami-037335e05dd722817
Kernel (e.g. uname -a): Linux ip-10-0-81-145.us-west-2.compute.internal 5.4.226-129.415.amzn2.x86_64 #1 SMP Fri Dec 9 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

Linux ip-10-0-81-145.us-west-2.compute.internal 5.4.226-129.415.amzn2.x86_64 #1 SMP Fri Dec 9 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-10-0-81-145 ~]$ cat /etc/eks/release
BASE_AMI_ID="ami-0a5eade221148fcb0"
BUILD_TIME="Thu Jan  5 02:19:24 UTC 2023"
BUILD_KERNEL="5.4.226-129.415.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2023-01-25T23:43:25Z

Thanks for bringing this to our attention. I've addressed the issue and will update here when an AMI with the fix is publicly available.

Secretions · 2023-02-03T02:37:13Z

@cartermckinnon Can you reopen this? There is an issue with the new GPU AMI that still prevents this from working.

The cert config block was added, but the name of the section precisely matches what was done in the standard compute AMI:

[plugins."io.containerd.grpc.v1.cri".registry]

However, on the GPU AMI it seems that it needs to be as follows:

[plugins.cri.registry]

Similar to the sandbox entry:

# Standard EKS AMI
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

# GPU AMI
[plugins.cri]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

I've tested this by modifying the config as follows and running systemctl restart containerd, and this modified config worked for me:

root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins.cri]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

# [plugins."io.containerd.grpc.v1.cri".registry]
[plugins.cri.registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins.cri.containerd.default_runtime]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"

[plugins.cri.containerd.default_runtime.options]
Runtime = "/etc/docker-runtimes.d/nvidia"

[plugins.cri.containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"

[plugins.cri.containerd.runtimes.nvidia.options]
Runtime = "/etc/docker-runtimes.d/nvidia"

Thanks for hopping on this so quickly earlier, btw.

Secretions mentioned this issue Jan 25, 2023

PLAT-6183: Setup containerd certs directory on GPU nodes dominodatalab/terraform-aws-eks#26

Merged

cartermckinnon added the bug Something isn't working label Feb 1, 2023

cartermckinnon mentioned this issue Feb 1, 2023

update CHANGELOG for AMI Release v20230127 #1165

Merged

Issacwww closed this as completed in #1165 Feb 2, 2023

This was referenced Feb 3, 2023

PLAT-6183: Update gpu remediation dominodatalab/cdk-cf-eks#130

Merged

PLAT-6183: Update gpu ami remediation dominodatalab/terraform-aws-eks#28

Merged

containerd certificate config using incorrect header section for the EKS 1.24 GPU AMI #1168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

Secretions commented Jan 25, 2023

cartermckinnon commented Jan 25, 2023

Secretions commented Feb 3, 2023

containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

Comments

Secretions commented Jan 25, 2023

cartermckinnon commented Jan 25, 2023

Secretions commented Feb 3, 2023