Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysbox k8s directory mounted as nobody #800

Open
raphaelfff opened this issue May 6, 2024 · 9 comments
Open

sysbox k8s directory mounted as nobody #800

raphaelfff opened this issue May 6, 2024 · 9 comments

Comments

@raphaelfff
Copy link

raphaelfff commented May 6, 2024

Here is the situation, we are running sysbox in GKE (to run Coder), we have a mount for docker backed by a PVC, sometimes when a pod restarts, /var/lib/docker ens up being owned by nobody:nogroup in the pod:

root@coder:/# ls -lah /var/lib
drwx--x--- 12 nobody nogroup 4.0K May  6 12:30 docker

restarting the pod a bunch of times end up fixing the issue, but not able to figure out why/how
I suspect that this issue happen when the pod gets scheduled in a different node ?

This is quite disruptive as the only way out is to delete that pod and make a new one, loosing the PVC, and the data associated...

pod.yaml
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      coder.workspace_id: e832bafe-2d57-4d56-8e53-a807a86d0869
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        io.kubernetes.cri-o.userns-mode: auto:size=65536
      creationTimestamp: null
      labels:
        coder.workspace_id: e832bafe-2d57-4d56-8e53-a807a86d0869
    spec:
      automountServiceAccountToken: true
      containers:
      - command:
        - sh
        - -c
        - "            set -e\n\n            W_USER=MYUSER\n\n            # Add a
          user so that you're not developing as the `root` user\n            useradd
          $W_USER \\\n              --create-home \\\n              --shell=/bin/bash
          \\\n              --groups=docker \\\n              --uid=1000 \\\n              --user-group\n
          \           echo \"$W_USER ALL=(ALL) NOPASSWD:ALL\" >>/etc/sudoers.d/nopasswd\n\n
          \           # Start the Coder agent as the user once systemd has started
          up\n            # /!\\ The space before EOT must match the current indenting
          of the terminating one!\n            sudo -u $W_USER --preserve-env=CODER_AGENT_TOKEN
          /bin/bash -- <<-'            EOT' &\n            while [[ ! $(systemctl
          is-system-running) =~ ^(running|degraded) ]]\n            do\n              echo
          \"Waiting for system to start... $(systemctl is-system-running)\"\n              sleep
          2\n            done\n            #!/usr/bin/env sh\nset -eux\n# Sleep for
          a good long while before exiting.\n# This is to allow folks to exec into
          a failed workspace and poke around to\n# troubleshoot.\nwaitonexit() {\n\techo
          \"=== Agent script exited with non-zero code. Sleeping 24h to preserve logs...\"\n\tsleep
          86400\n}\ntrap waitonexit EXIT\nBINARY_DIR=\"${BINARY_DIR:-$(mktemp -d -t
          coder.XXXXXX)}\"\nBINARY_NAME=coder\nBINARY_URL=https://coder.company.com/bin/coder-linux-amd64\ncd
          \"$BINARY_DIR\"\n# Attempt to download the coder agent.\n# This could fail
          for a number of reasons, many of which are likely transient.\n# So just
          keep trying!\nwhile :; do\n\t# Try a number of different download tools,
          as we don not know what we\n\t# will have available.\n\tstatus=\"\"\n\tif
          command -v curl >/dev/null 2>&1; then\n\t\tcurl -fsSL --compressed \"${BINARY_URL}\"
          -o \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif command -v wget >/dev/null
          2>&1; then\n\t\twget -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif
          command -v busybox >/dev/null 2>&1; then\n\t\tbusybox wget -q \"${BINARY_URL}\"
          -O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telse\n\t\techo \"error:
          no download tool found, please install curl, wget or busybox wget\"\n\t\texit
          127\n\tfi\n\techo \"error: failed to download coder agent\"\n\techo \"       command
          returned: ${status}\"\n\techo \"Trying again in 30 seconds...\"\n\tsleep
          30\ndone\n\nif ! chmod +x $BINARY_NAME; then\n\techo \"Failed to make $BINARY_NAME
          executable\"\n\texit 1\nfi\n\nhaslibcap2() {\n\tcommand -v setcap /dev/null
          2>&1\n\tcommand -v capsh /dev/null 2>&1\n}\nprintnetadminmissing() {\n\techo
          \"The root user does not have CAP_NET_ADMIN permission. \" + \\\n\t\t\"If
          running in Docker, add the capability to the container for \" + \\\n\t\t\"improved
          network performance.\"\n\techo \"This has security implications. See https://man7.org/linux/man-pages/man7/capabilities.7.html\"\n}\n\n#
          Attempt to add CAP_NET_ADMIN to the agent binary. This allows us to increase\n#
          network buffers which improves network transfer speeds.\nif [ -n \"${USE_CAP_NET_ADMIN:-}\"
          ]; then\n\t# If running as root, we do not need to do anything.\n\tif [
          \"$(id -u)\" -eq 0 ]; then\n\t\techo \"Running as root, skipping setcap\"\n\t\t#
          Warn the user if root does not have CAP_NET_ADMIN.\n\t\tif ! capsh --has-p=CAP_NET_ADMIN;
          then\n\t\t\tprintnetadminmissing\n\t\tfi\n\n\t# If not running as root,
          make sure we have sudo perms and the \"setcap\" +\n\t# \"capsh\" binaries
          exist.\n\telif sudo -nl && haslibcap2; then\n\t\t# Make sure the root user
          has CAP_NET_ADMIN.\n\t\tif sudo -n capsh --has-p=CAP_NET_ADMIN; then\n\t\t\tsudo
          -n setcap CAP_NET_ADMIN=+ep ./$BINARY_NAME || true\n\t\telse\n\t\t\tprintnetadminmissing\n\t\tfi\n\n\t#
          If we are not running as root, cant sudo, and \"setcap\" does not exist,
          we\n\t# cannot do anything.\n\telse\n\t\techo \"Unable to setcap agent binary.
          To enable improved network performance, \" + \\\n\t\t\t\"give the agent
          passwordless sudo permissions and the \\\"setcap\\\" + \\\"capsh\\\" binaries.\"\n\t\techo
          \"This has security implications. See https://man7.org/linux/man-pages/man7/capabilities.7.html\"\n\tfi\nfi\n\nexport
          CODER_AGENT_AUTH=\"token\"\nexport CODER_AGENT_URL=\"https://coder.company.com/\"\nexec
          ./$BINARY_NAME agent\n\n            EOT\n\n            exec /sbin/init\n"
        env:
        - name: CODER_AGENT_TOKEN
          value: XXXXX
        - name: SYSBOX_ALLOW_TRUSTED_XATTR
          value: "FALSE"
        image: us.gcr.io/XXX/docker-image-systemd
        imagePullPolicy: IfNotPresent
        name: coder-MYUSER-0
        resources:
          limits:
            cpu: "1"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 4Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /ws
          mountPropagation: None
          name: data
          subPath: workspaces
        - mountPath: /home
          mountPropagation: None
          name: data
          subPath: home
        - mountPath: /var/lib/docker
          mountPropagation: None
          name: data
          subPath: var/lib/docker
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      hostname: coder-MYUSER-0
      restartPolicy: Always
      runtimeClassName: sysbox-runc
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        fsGroupChangePolicy: OnRootMismatch
        runAsNonRoot: false
        runAsUser: 0
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: coder-e832bafe-2d57-4d56-8e53-a807a86d0869
@ctalledo
Copy link
Member

ctalledo commented May 8, 2024

Hi @raphaelfff,

Happy to help, though you should also reach out to Coder.

we have a mount for docker backed by a PVC, sometimes when a pod restarts, /var/lib/docker ens up being owned by nobody:nogroup in the pod

What type of PVC is it?

Also, how does findmnt look inside the pod when things work and when they don't?

I ask because the PVC is bind-mounted into the Sysbox pod, and Sysbox uses "ID-mapped-mounts" or "shiftfs" (see here) on top of that bind-mount in order for the files to show up with proper ownership inside the rootless Sysbox container. If files show up as nobody:nogroup, it means the ID-mapping or shiftfs mounts are not taking effect.

restarting the pod a bunch of times end up fixing the issue, but not able to figure out why/how

Interesting ... not sure what's going on. But if you can pin-it to specific K8s nodes, that's a good clue.

@raphaelfff
Copy link
Author

Happy to help, though you should also reach out to Coder.

I think coder is out of the picture here, its a pure sysbox problem imo, coder was just general context
Its something to do with sysbox id mapping, and the owner docker sets on its files (smth like that)

What type of PVC is it?

Its a GKE PD

Also, how does findmnt look inside the pod when things work and when they don't?

Atm i dont have a broken env at hand... i will update when i have one, this issu was about starting investigation... Do you have any command you would recommend running ?

@raphaelfff
Copy link
Author

Okay another ws broke:

$ findmnt
TARGET                             SOURCE                                                       FSTYPE   OPTIONS
/                                  overlay                                                      overlay  rw,relatime,lowerdir=/var/lib/containers/storage/overlay/l/4G5KAIEXKOIRDK2Q2IM7PGZ3QX:/var/lib/containers/st
├─/run                             tmpfs                                                        tmpfs    rw,nosuid,nodev,size=13168052k,nr_inodes=819200,mode=755,uid=165536,gid=165536,inode64
│ └─/run/lock                      tmpfs                                                        tmpfs    rw,nosuid,nodev,noexec,relatime,size=5120k,uid=165536,gid=165536,inode64
├─/sys                             sysfs                                                        sysfs    rw,nosuid,nodev,noexec,relatime
│ ├─/sys/firmware                  tmpfs                                                        tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/sys/fs/cgroup                 cgroup                                                       cgroup2  rw,nosuid,nodev,noexec,relatime
│ ├─/sys/devices/virtual           sysboxfs[/sys/devices/virtual]                               fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ ├─/sys/kernel                    sysboxfs[/sys/kernel]                                        fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ └─/sys/module/nf_conntrack/parameters
│                                  sysboxfs[/sys/module/nf_conntrack/parameters]                fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
├─/proc                            proc                                                         proc     rw,nosuid,nodev,noexec,relatime
│ ├─/proc/bus                      proc[/bus]                                                   proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/fs                       proc[/fs]                                                    proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/irq                      proc[/irq]                                                   proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/sysrq-trigger            proc[/sysrq-trigger]                                         proc     ro,nosuid,nodev,noexec,relatime
│ ├─/proc/acpi                     tmpfs                                                        tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/keys                     devtmpfs[/null]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/proc/timer_list               devtmpfs[/null]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/proc/scsi                     tmpfs                                                        tmpfs    ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/swaps                    sysboxfs[/proc/swaps]                                        fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ ├─/proc/sys                      sysboxfs[/proc/sys]                                          fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ └─/proc/uptime                   sysboxfs[/proc/uptime]                                       fuse     rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
├─/dev                             tmpfs                                                        tmpfs    rw,nosuid,size=65536k,mode=755,uid=165536,gid=165536,inode64
│ ├─/dev/mqueue                    mqueue                                                       mqueue   rw,nosuid,nodev,noexec,relatime
│ ├─/dev/pts                       devpts                                                       devpts   rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666
│ ├─/dev/null                      devtmpfs[/null]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/random                    devtmpfs[/random]                                            devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/kmsg                      devtmpfs[/null]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/shm                       shm                                                          tmpfs    rw,nosuid,nodev,noexec,relatime,size=65536k,inode64
│ ├─/dev/termination-log           /dev/root[/var/lib/kubelet/pods/74b5abbb-e331-4002-8640-2018979ba168/containers/coder/eb185c4a]
│ │                                                                                             ext4     rw,relatime,idmapped,discard,errors=remount-ro
│ ├─/dev/full                      devtmpfs[/full]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/tty                       devtmpfs[/tty]                                               devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/zero                      devtmpfs[/zero]                                              devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ └─/dev/urandom                   devtmpfs[/urandom]                                           devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
├─/etc/resolv.conf                 /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/resolv.conf]
│                                                                                               shiftfs  rw,nosuid,nodev,noexec,relatime
├─/etc/hostname                    /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/hostname]
│                                                                                               shiftfs  rw,relatime
├─/run/.containerenv               /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/.containerenv]
│                                                                                               shiftfs  rw,relatime
├─/var/lib/docker                  /dev/sdb[/var/lib/docker]                                    ext4     rw,relatime
├─/ws                              /dev/sdb[/workspaces]                                        ext4     rw,relatime,idmapped
├─/home                            /dev/sdb[/home]                                              ext4     rw,relatime,idmapped
├─/etc/hosts                       /dev/root[/var/lib/kubelet/pods/74b5abbb-e331-4002-8640-2018979ba168/etc-hosts]
│                                                                                               ext4     rw,relatime,idmapped,discard,errors=remount-ro
├─/run/secrets/kubernetes.io/serviceaccount
│                                  /var/lib/sysbox/shiftfs/ba472717-3446-4b24-9d37-6530a72a68a3 shiftfs  ro,relatime
├─/var/lib/k0s                     /dev/root[/var/lib/sysbox/k0s/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/var/lib/buildkit                /dev/root[/var/lib/sysbox/buildkit/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
│                                  /dev/root[/var/lib/sysbox/containerd/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/var/lib/rancher/k3s             /dev/root[/var/lib/sysbox/rancher-k3s/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/var/lib/rancher/rke2            /dev/root[/var/lib/sysbox/rancher-rke2/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/var/lib/kubelet                 /dev/root[/var/lib/sysbox/kubelet/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│                                                                                               ext4     rw,relatime,discard,errors=remount-ro
├─/usr/src/linux-headers-5.15.0-1054-gke
│                                  /dev/root[/usr/src/linux-headers-5.15.0-1054-gke]            ext4     ro,relatime,idmapped,discard,errors=remount-ro
├─/usr/src/linux-gke-headers-5.15.0-1054
│                                  /dev/root[/usr/src/linux-gke-headers-5.15.0-1054]            ext4     ro,relatime,idmapped,discard,errors=remount-ro
└─/usr/lib/modules/5.15.0-1054-gke /dev/root[/usr/lib/modules/5.15.0-1054-gke]                  ext4     ro,relatime,idmapped,discard,errors=remount-ro

@raphaelfff
Copy link
Author

Bump @ctalledo What is the next step ?

@ctalledo
Copy link
Member

Hi @raphaelfff,

So per your description above, this is the mount that is showing up with nobody:nogroup correct?

├─/var/lib/docker                  /dev/sdb[/var/lib/docker]                                    ext4     rw,relatime

And I can see in the pod.yaml that it's backed by a PVC.

Don't know exactly why it's showing up with nobody:nogroup (as opposed to root:root), but let me provide a bit of background to see if we can solve it.

When a Sysbox container starts, it maps the root in the container to an unprivileged user at host level (e.g., 0 -> 100000). Furthermore, when Sysbox sees that the container has a bind-mount of a host dir into the container's /var/lib/docker, Sysbox will try to either ID-map or else chown the contents of that host dir, such that they show up with proper ownership (e.g. root:root) inside the container. When the container stops, then Sysbox will revert the operation (i.e., remove the ID-map, or chown back).

This process must be failing somehow. In the past, before ID-map mounts were supported in the kernel, Sysbox would use chown and the process would sometimes fail (or be too slow) if the host dir had too many files (which is sometimes the case on /var/lib/docker mounts).

With ID-mapped mounts it's much better (no more chowing), but it requires "overlayfs-over-ID-mapped-mounts" support which landed in kernel 5.19+.

Questions:

  1. What kernel version do your K8s nodes have? Ideally they would be 5.19+.
  2. Can you provide the output of the sysbox-mgr log (journalctl -u sysbox-mgr)? If that log shows "shifting uids at ..." then it's using the chown operation instead of ID-mapped-mounts, which could point to the problem.

Also: make sure the host dir (PVC) that is mounted into the pod's /var/lib/docker is only mounted into one such pod at a time (i.e., /var/lib/docker can't be shared simultaneously by multiple pods with docker engines inside).

That's all that comes to mind.

Again, I think Coder should be helping you here too (even if it turns out to be a Sysbox problem).

@raphaelfff
Copy link
Author

Thanks for your answer

Here are some more deets:

$ uname -r
5.15.0-1054-gke

So that means that its not actually benefiting from id mapped amounts...

  1. logs are gone, i ll provide them when the issue happens again

One question: since /var/lib/docker is on a volume, it may be mounted on node A one day 1, and node B on day two, if the chown failed, i guess when starting on node B it would have the wrong UID/GID, and that could cause the nobody ?

The mount is ReadWriteOnce, and the strategy set to Recreate, that should mean only a single pod would be able to read/write

@ctalledo
Copy link
Member

ctalledo commented Jun 4, 2024

Hi @raphaelfff,

So that means that its not actually benefiting from id mapped amounts...

Correct, at least not for the volumes mounted at /var/lib/docker. That means it must be using chown, which is not ideal (can be quite slow depending on the size of /var/lib/docker, which can grow to several GBs over time; and if it takes too long the the pod start or stop can timeout and then we end up with inconsistent user/group-IDs in the files ... not good).

One question: since /var/lib/docker is on a volume, it may be mounted on node A one day 1, and node B on day two, if the chown failed, i guess when starting on node B it would have the wrong UID/GID, and that could cause the nobody ?

That's exactly right ... which is why chown is not a good solution (though it was the only one before ID-mapped-mounts on overlayfs appeared in kernel 5.19+).

Sounds like you need a K8s node with kernel 5.19+ in order for this to work reliably. With ID-mapped-mounts, the "chown" is basically instant (it's done via user-id/group-id mapping in the kernel), so it works much better.

The mount is ReadWriteOnce, and the strategy set to Recreate, that should mean only a single pod would be able to read/write

OK that's perfect.

@raphaelfff
Copy link
Author

I m gonna have to wait for GKE to upgrade their Ubuntu Containerd image kernel...
Can you think of a way this nobody issue can be fixed manually? (mounting the volume into another pod and run some chmod ?)

@ctalledo
Copy link
Member

ctalledo commented Jul 9, 2024

Hi @raphaelfff,

Apologies for the belated response.

I m gonna have to wait for GKE to upgrade their Ubuntu Containerd image kernel...

Yes, I am afraid that's the only option; I am amazed GKE is still in kernel 5.15 when Linux is at 6.9 already (!). You would think they would at least offer an option to customize the kernel, but there isn't an easy one as far as I can tell.

Can you think of a way this nobody issue can be fixed manually? (mounting the volume into another pod and run some chmod ?)

The problem with chown failing usually occurs when the contents of /var/lib/docker grow to several GBs. So maybe a poor-man's solution is to keep that below 1GB, by running docker system prune periodically (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants