Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use userns=keep-id together with nvidia-container-toolkit on rootless podman #35

Closed
Clockwork-Muse opened this issue Sep 20, 2022 · 5 comments

Comments

@Clockwork-Muse
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Attempting to keep the external user id with a rootless container while using the nvidia-container-toolkit fails.

Steps to reproduce the issue:

  1. Install the nvidia-container-toolkit

  2. Run podman run --rm --userns keep-id docker.io/nvidia/cudagl:11.4.2-runtime-ubuntu20.04 nvidia-smi

Describe the results you received:

Error: OCI runtime error: error executing hook /usr/bin/nvidia-container-toolkit (exit code: 1)

Describe the results you expected:
The normal output of nvidia-smi

Additional information you deem important (e.g. issue happens only occasionally):

  • The use of --security-opt=label=disable (as in the nvidia install documentation) does not appear to make a difference.
  • Running the same command without --userns keep-id succeeds (generates expected output).
  • Deleting the hook and running id -u shows the remapped uid.

The real deployment situation is a devcontainer that contains graphical tools needing to access the X11 port, so mapping the uid to the host user is a requirement.

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.17.3
Built:        Wed Dec 31 16:00:00 1969
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: 'conmon: /usr/bin/conmon'
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: unknown'
  cpus: 12
  distribution:
    codename: jammy
    distribution: ubuntu
    version: "22.04"
  eventLogger: journald
  hostname: bumblebee
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.15.0-47-generic
  linkmode: dynamic
  logDriver: journald
  memFree: 9103364096
  memTotal: 33545084928
  ociRuntime:
    name: crun
    package: 'crun: /usr/bin/crun'
    path: /usr/bin/crun
    version: |-
      crun version 0.17
      commit: 0e9229ae34caaebcb86f1fde18de3acaf18c6d9a
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.1
      commit: 6a7b16babc95b6a3056b33fb45b74a6f62262dd4
      libslirp: 4.6.1
  swapFree: 1023406080
  swapTotal: 1023406080
  uptime: 4h 16m 33.83s (Approximately 0.17 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/simhoff/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/simhoff/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 2
  runRoot: /run/user/1001/containers
  volumePath: /home/simhoff/.local/share/containers/storage/volumes
version:
  APIVersion: 3.4.4
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.17.3
  OsArch: linux/amd64
  Version: 3.4.4

Package info (e.g. output of rpm -q podman or apt list podman):

podman/jammy,now 3.4.4+ds1-1ubuntu1 amd64 [installed]

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

  • Have you tested with the latest version of Podman: No - latest in Ubuntu repo
  • Have you checked the Podman Troubleshooting Guide: Yes.

Additional environment details (AWS, VirtualBox, physical, etc.):
N/A

Does the hook leave any logs in the journal indicate why it failed?

Sep 20 07:41:16 bumblebee podman[201489]: 
Sep 20 07:41:16 bumblebee podman[201489]: 2022-09-20 07:41:16.819559074 -0700 PDT m=+0.069589907 container create e50788bf60ccf1b5cfa3bac3b3ac2fdb7f023c7cefc166f50a5f17489fcd7a89 (image=docker.io/nvidia/cudagl:11.4.2-runtime-ubuntu20.04, name=silly_mestorf, maintainer=NVIDIA CORPORATION <cudatools@nvidia.com>)
Sep 20 07:41:16 bumblebee kernel: overlayfs: fs on '/home/simhoff/.local/share/containers/storage/overlay/l/32ZW2IWOGZA5N5WV53SBPQUOYO' does not support file handles, falling back to xino=off.
Sep 20 07:41:16 bumblebee systemd[2119]: Started libpod-conmon-e50788bf60ccf1b5cfa3bac3b3ac2fdb7f023c7cefc166f50a5f17489fcd7a89.scope.
Sep 20 07:41:16 bumblebee systemd[2119]: Started libcrun container.
Sep 20 07:41:16 bumblebee podman[201489]: 2022-09-20 07:41:16.773977313 -0700 PDT m=+0.024008635 image pull  docker.io/nvidia/cudagl:11.4.2-runtime-ubuntu20.04
Sep 20 07:41:16 bumblebee podman[201489]: 2022-09-20 07:41:16.884806613 -0700 PDT m=+0.134837446 container remove e50788bf60ccf1b5cfa3bac3b3ac2fdb7f023c7cefc166f50a5f17489fcd7a89 (image=docker.io/nvidia/cudagl:11.4.2-runtime-ubuntu20.04, name=silly_mestorf, maintainer=NVIDIA CORPORATION <cudatools@nvidia.com>)

I got nuffin'.
Debug logging is turned on in /etc/nvidia-container-runtime/config.toml (to a directory/file I have permission to write to), but it doesn't generate a file at all.

Moving from containers/podman#15863

@elezar
Copy link
Member

elezar commented Oct 10, 2022

@Clockwork-Muse in order to better leverage features such as userns=keep-id in low-level runtimes (runc, crun) without requireing that these be reimplemented in our NVIDIA Container Library we are looking at moving to using CDI as the recommended mechanism for injecting NVIDIA Devices for podman.

We have just released v1.12.0-rc.1 of the NVIDIA Container Toolkit to our experimental repositories that includes tooling to generate a CDI specification for use with CDI-enabled container engines (containerd, cri-o) or CLI clients like podman. Once this version of the nvidia-container-toolkit is installed, the following command will output a CDI specification for all available NVIDIA devices to STDOUT:

nvidia-ctk info generate-cdi

If the generated spec is copied to /etc/cdi/nvidia.yaml (or with json output to /etc/cdi/nvidia.json) and a podman version that supports CDI (at least v4.1.0) is used, a container can be started with GPU support as follows:

podman run --rm -ti --device=nvidia.com/gpu=gpu0 ubuntu nvidia-smi -L

Note that the generated spec also contains a definition for the device nvidia.com/gpu=all which will include all devices.

Any feedback or comments on the new functionality will be appreciated.

@Clockwork-Muse
Copy link
Author

I haven't installed the latest RC yet, but manually creating what I believe should be the relevant CDI spec file still throws the same error:

Error: OCI runtime error: crun: error executing hook /usr/bin/nvidia-container-toolkit (exit code: 1)

The manually created file is this (from the podman pr enabling it):

{
  "cdiVersion": "0.2.0",
  "kind": "nvidia.com/gpu",
  "devices": [
    {
      "name": "gpu0",
      "containerEdits": {
        "env": [
          "NVIDIA_VISIBLE_DEVICES=0"
        ]
      }
    }
  ],
  "containerEdits": {
    "hooks": [
      {
        "hookName": "prestart",
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": [
          "nvidia-container-toolkit",
          "prestart"
        ]
      }
    ]
  }
}

@elezar
Copy link
Member

elezar commented Oct 10, 2022

I would not use that spec as that still uses the NVIDIA Container CLI to make modifications to the container namespace. Please see https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/8 and NVIDIA/nvidia-container-runtime#85 (comment) for a more up to date spec.

@Clockwork-Muse
Copy link
Author

Yes!
That (or a modified version) seems to get things working.

@elezar
Copy link
Member

elezar commented Oct 10, 2022

Yes! That (or a modified version) seems to get things working.

Great! The nvidia-ctk info generate-cdi command should generate the specification for your installation (i.e. for your driver version and available devices).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants