Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd registry certificates configured on regular EKS node AMI, but not GPU AMI #1154

Closed
Secretions opened this issue Jan 25, 2023 · 2 comments · Fixed by #1165
Closed
Labels
bug Something isn't working

Comments

@Secretions
Copy link

What happened:

  • Started two EKS 1.24 nodes:
    • one based on the standard EKS node AMI
    • the other based on the EKS GPU node AMI
  • Get a terminal on each node
  • Cat active config in /etc/containerd/config.toml
  • A CRI registry plugin block is present on the EKS node AMI
  • No such block is present on the EKS GPU node AMI
  • Daemonset propagates /etc/docker/certs.d with cert identically on both nodes
  • Attempts by pods to pull images from private registry backed by the configured CA cert...
    • Worked on the standard EKS node
    • Failed on the GPU node due to certificate validation failures

What you expected to happen:

  • The CRI registry plugin should be configured on the GPU node with the same paths as the standard EKS node
  • With certs propagated identically across nodes, both nodes should successfully pull from the same registry

How to reproduce it (as minimally and precisely as possible):

  • Start a GPU node
  • Get a terminal on the node
  • Check /etc/containerd/config.toml
  • Notice lack of CRI registry plugin block

Anything else we need to know?:

A PR previously added a config for containerd to support configuring registry certificates in both the classic docker path and the containerd one: #1049

This is the config block that was added:

[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

This configuration is present on the standard EKS node AMI, but not in any variation on the GPU node AMI. My understanding is that the GPU AMI doesn't use the containerd configs in this repo, and that there is some Amazon internal repo that is the source of truth for the GPU version of the containerd configs. If that is correct, then this internal repo needs to also be updated with a corresponding change to achieve feature parity with the standard EKS ami.

Environment:

  • AWS Region: us-west-2
  • Instance Type(s): g4dn.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.3
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.24
  • AMI Version: ami-037335e05dd722817
  • Kernel (e.g. uname -a): Linux ip-10-0-81-145.us-west-2.compute.internal 5.4.226-129.415.amzn2.x86_64 #1 SMP Fri Dec 9 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
Linux ip-10-0-81-145.us-west-2.compute.internal 5.4.226-129.415.amzn2.x86_64 #1 SMP Fri Dec 9 12:54:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-10-0-81-145 ~]$ cat /etc/eks/release
BASE_AMI_ID="ami-0a5eade221148fcb0"
BUILD_TIME="Thu Jan  5 02:19:24 UTC 2023"
BUILD_KERNEL="5.4.226-129.415.amzn2.x86_64"
ARCH="x86_64"
@cartermckinnon
Copy link
Member

Thanks for bringing this to our attention. I've addressed the issue and will update here when an AMI with the fix is publicly available.

@Secretions
Copy link
Author

@cartermckinnon Can you reopen this? There is an issue with the new GPU AMI that still prevents this from working.

The cert config block was added, but the name of the section precisely matches what was done in the standard compute AMI:

[plugins."io.containerd.grpc.v1.cri".registry]

However, on the GPU AMI it seems that it needs to be as follows:

[plugins.cri.registry]

Similar to the sandbox entry:

# Standard EKS AMI
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

# GPU AMI
[plugins.cri]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

I've tested this by modifying the config as follows and running systemctl restart containerd, and this modified config worked for me:

root = "/var/lib/containerd"
state = "/run/containerd"

[grpc]
address = "/run/containerd/containerd.sock"

[plugins.cri]
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"

# [plugins."io.containerd.grpc.v1.cri".registry]
[plugins.cri.registry]
config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

[plugins.cri.containerd.default_runtime]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"

[plugins.cri.containerd.default_runtime.options]
Runtime = "/etc/docker-runtimes.d/nvidia"

[plugins.cri.containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"

[plugins.cri.containerd.runtimes.nvidia.options]
Runtime = "/etc/docker-runtimes.d/nvidia"

Thanks for hopping on this so quickly earlier, btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants