K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

rodnymolina · 2020-09-12T17:57:15Z

I ran into this one while trying to scope the level of effort required to launch K8s PODs through Sysbox runtime.

I initially stumbled into issue #66, which hasn't been properly fixed yet, and then reproduced the problem described herein. Notice that even though the symptoms are identical (i.e, unable to mount sysfs), the cause seems to be different in this case, and that's why we are tracking this issue separately.

After multiple attempts at bysecting the container's OCI spec, i was able to identify the spec instruction causing this problem; however, the low-level root-cause has not been found yet.

Problem is reproduced whenever a sandbox container (e.g. "pause") is instantiated by K8s master. There's nothing specially relevant in the spec of this container, except for the fact that a "path" element is passed as part of the network-namespace element:

        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "path": "/var/run/netns/cni-ca69f110-38f9-4be8-dca4-10cbb16f8695",
                "type": "network"
            }
        ],

As per OCI's specification, a compliant runtime is expected to place the to-be-created container in the network namespace indicated by this file (which in turn, represents a bind-mount of a "/proc/pid/ns/net").

path (string, OPTIONAL) - namespace file. This value MUST be an absolute path in the runtime mount namespace. The runtime MUST place the container process in the namespace associated with that path. The runtime MUST generate an error if path is not associated with a namespace of type type. If path is not specified, the runtime MUST create a new container namespace of type type.

We can re-create the observed behavior by following the steps indicated below ...

Let's start by creating the shared network namespace that our POD will be part of:

rmolina@heavy-vm-bionic:~/wsp$ sudo ip netns add test-ns-1

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ ls -li /run/netns/
total 0
4026532321 -r--r--r-- 1 root root 0 May 13 02:31 test-ns-1
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ findmnt
...
├─/run                                tmpfs                  tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│ ├─/run/lock                         tmpfs                  tmpfs       rw,nosuid,nodev,noexec,relatime,size=5120k
│ ├─/run/user/1000                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1000,gid=1000
│ ├─/run/user/1001                    tmpfs                  tmpfs       rw,nosuid,nodev,relatime,size=815196k,mode=700,uid=1001,gid=1001
│ ├─/run/netns/test-ns-1              nsfs[net:[4026532321]] nsfs        rw
│ └─/run/netns                        tmpfs[/netns]          tmpfs       rw,nosuid,noexec,relatime,size=815200k,mode=755
│   └─/run/netns/test-ns-1            nsfs[net:[4026532321]] nsfs        rw
├─/boot                               /dev/sda1              ext4        rw,relatime
...

Let's now add this network-ns file to our own baked spec:

		"namespaces": [
			{
				"type": "pid"
			},
		        {
			        "path": "/var/run/netns/test-ns-1",
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			},
			{
				"type": "cgroup"
			}
		],

Problem is right away reproduced:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo sysbox-runc run ubuntu-1
container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:58: setting up rootfs mounts caused \\\"rootfs_linux.go:928: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

As expected, problem is not reproduced with upstream runc in the default configuration (no user-ns), as this would also fail in all K8s deployments. However, the same exact issue is reproduced the moment that we request user-ns creation.

See no issue with runc when relying on the above spec:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
#

Let's modify the spec to explicitly activate user-ns creation:

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat /etc/subuid
lxd:100000:65536
root:100000:65536
vagrant:165536:65536
rmolina:231072:65536
sysbox:296608:268435456

root@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu# cat config.json
...
        "linux": {
        "uidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
        "gidMappings": [
            {
                "hostID": 296608,
                "containerID": 0,
                "size": 268435456
            }
        ],
            "namespaces": [
                        {
                                "type": "pid"
                        },
                        {
                                "path": "/var/run/netns/test-ns-1",
                                "type": "network"
                        },
                        {
                                "type": "ipc"
                        },
                        {
                                "type": "uts"
                        },
                        {
                                "type": "mount"
                        },
                        {
                                "type": "user"
                        },
                        {
                                "type": "cgroup"
                        }
                ],
...

Trying runc once again shows the same problem reported by sysbox-runc:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo runc run ubuntu-1
WARN[0000] exit status 1
ERRO[0000] container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"sysfs\\\" to rootfs \\\"/home/rmolina/wsp/05-12-2020/sysbox/ubuntu/rootfs\\\" at \\\"/sys\\\" caused \\\"operation not permitted\\\"\""
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

Problem seems to be caused by some sort of kernel limitation or requirement imposed on user-namespaces and their relationship with network-namespaces. See that issue is also reproduced when leaving runtimes out of the equation:


<-- With network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -n -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ echo $?
0

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo rm -rf /root/sys

<-- No network-ns:

rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$ sudo unshare -m -u -i -p -U -f -r bash -c "mkdir /root/sys && mount -t sysfs sysfs /root/sys"
mount: /root/sys: permission denied.
rmolina@heavy-vm-bionic:~/wsp/05-12-2020/sysbox/ubuntu$

More details to come ...

The text was updated successfully, but these errors were encountered:

rodnymolina · 2020-09-12T17:57:17Z

Btw, it's interesting to know that Docker doesn't have a proper solution to this problem. See that they don't support user-namespace along "--net=host" functionality.

https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

On the other hand, i can see some logic written in Docker (libnetwork), as well as in the K8s dockershim implementation, to deal with these scenarios through the use of container "hooks". But it's not clear to me how mature this implementation is, nor if anyone is actually using K8s along user-namespaces.

References:

opencontainers/runc#799
moby/moby#21800
opencontainers/runc#807
systemd/systemd#1555
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=87a8ebd637dafc255070f503909a053cf0d98d3f

Will get back to this when done with the ongoing release cycle.

rodnymolina · 2020-09-12T17:57:17Z

I found what could be a valid explanation for the behavior observed above. As suspected, kernel is imposing certain restrictions to users trying to mount procfs and sysfs from within a non-init user-namespace.

In these scenarios, kernel is expecting the user creating the container to have CAP_SYS_ADMIN rights in the user-ns that owns the network-ns in question. By the time we mount sysfs we are already "inside" the new user-ns, and by then we have no rights in any resource owned by the root (init) user-ns, including the root network-ns.

This is the kernel patch that added this restriction:

https://lists.linuxfoundation.org/pipermail/containers/2013-August/033388.html

A potential solution i can think of is to extend our existing sysbox-runc's "proxy" handler to have the parent-process being the one performing the sysfs mount on behalf of the init-process. I'll get back to this in a couple of weeks once we are done with our current release cycle.

rodnymolina · 2020-09-12T17:58:36Z

Ref #64.

ctalledo · 2020-10-30T21:28:53Z

Here is a quick experiment to reproduce this issue with unshare:

$ sudo unshare -n bash                                                                                                                                                                                                                                                                                                        
$ unshare -U -m -i -p -u -C -f -r --mount-proc bash                                                                                                                                                                                                                                                                           
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
mount: /home/cesar//rootfs/sys: permission denied                                                                                                                                                                                                                                                                  
$ unshare -n bash                                                                                                                                                                                                                                                                                                             
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
(no problem)

In other words, when the netns is created before the userns, mounting sysfs inside that userns fails (for some reason). But if a netns is created inside the userns, mounting sysfs works without problem.

kylecarbs · 2020-11-10T06:48:52Z

Any updates on this or possible workarounds?

rodnymolina · 2020-11-10T17:53:48Z

@kylecarbs, unfortunately there's no workaround for this one at the moment, but we're actively working on this issue and we expect to have good news soon. Please stay tuned.

ctalledo · 2020-11-12T19:18:25Z

Here is a quick experiment to reproduce this issue with unshare:

$ sudo unshare -n bash                                                                                                                                                                                                                                                                                                        
$ unshare -U -m -i -p -u -C -f -r --mount-proc bash                                                                                                                                                                                                                                                                           
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
mount: /home/cesar//rootfs/sys: permission denied                                                                                                                                                                                                                                                                  
$ unshare -n bash                                                                                                                                                                                                                                                                                                             
$ mount -t sysfs sysfs sys                                                                                                                                                                                                                                                                                                    
(no problem)

In other words, when the netns is created before the userns, mounting sysfs inside that userns fails (for some reason). But if a netns is created inside the userns, mounting sysfs works without problem.

Found the reason for this behavior: when a process mounts sysfs, the kernel checks that the process has CAP_SYS_ADMIN in the user namespace associated with that network namespace.

In the example above, the net-ns was created before the user-ns was created, so that net-ns is associated with the init user-ns, not the newly created user-ns. As a result, a process inside the user-ns can't mount sysfs because it does not have CAP_SYS_ADMIN in the init user-ns.

However, when we later unshare the net-ns inside the user-ns, the situation changes: that new net-ns is associated with that user-ns, and the process invoking the mount of sysfs does have CAP_SYS_ADMIN in that user-ns. As a result, the sysfs mount succeeds.

ctalledo · 2020-11-12T19:33:32Z

This issue was found by Rodny while trying to scope the level of effort required to launch K8s PODs using the Sysbox runtime (aka sysbox pods).

Sysbox always uses the linux user-ns in the containers/pods it creates. It's a must-have for proper functionality & isolation.

From the prior comment, it's clear that the pod's user-ns must be created before the network-ns in order for sysfs mounts to work inside the pod. This requirement applies to other kernel network resources too (e..g, those exposed via procfs).

As a result, in order for K8s to create pods with sysbox, the user-ns associated with the pod must be created before the network ns (and all other namespaces too) associated with that pod.

This can't be done by sysbox itself, because per the K8s CRI spec, it's the CRI implementation (e.g., dockershim, containerd, or cri-o) that sequences this.

I've done some research and found that cri-o has experimental support for enabling the user-ns in pods. That is, upstream versions of cri-o are capable of creating the user-ns for a pod first, then create the remaining ns as required.

At a high level, the way this will works is:

The user creates a pod with an annotation indicating use of the "user-ns" and sets runtimeClass to the sysbox runtime.
k8s deploys that pod on a node
The k8s kubelet on that node talks to cri-o to create the pod with user-ns and sysbox
cri-o creates the user-ns and network-ns, then tells sysbox to create the containers for the pod
sysbox creates the containers

Other than cri-o, I don't believe the other CRI implementations (dockershim and containerd) support user-ns functionality. dockershim is in fact out of the question because it does not even support the runtimeClass spec required for K8s to deploy containers with sysbox.

Thus, it looks like an initial implementation of K8s + sysbox would require the latest versions of cri-o at this time.

I am working on this right now.

ctalledo · 2020-11-12T20:14:57Z

In the prior comments, we determined that the EPERM error that sysbox gets when mounting sysfs into a container occurred in scenarios where the network-ns for the sys container was created before the user-ns for that same sys container.

But there is another way to hit this EPERM error too: when sysbox runs in an environment where sysfs is mounted read-only.

For example, I was able to reproduce this by having K8s deploy a privileged pod that had docker and sysbox inside. The pod was deployed using the OCI runc. Inside the privileged pod, I started Docker and Sysbox, and then I tried to create a system container. But this failed with:

root@pod-with-sysbox:~/nestybox/sysbox# docker run --runtime=sysbox-runc -it --rm alpine                                                                       
docker: Error response from daemon: OCI runtime create failed: container_linux.go:364: starting container process caused "process_linux.go:533: container init caused \"rootfs_linux.go:62: setting up rootfs mounts caused \\\"rootfs_linux.go:932: mounting \\\\\\\"sysfs\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2
/fbdaf2935a2b8ffe777060a1db2e63be8da034f35c47315d5b544a4ca6718bf6/merged\\\\\\\" at \\\\\\\"sys\\\\\\\" caused \\\\\\\"operation not permitted\\\\\\\"\\\"\"": unknown.

The reason for this failure is that inside the privileged pod, sysfs is mounted read-only:

root@pod-with-sysbox:~/nestybox/sysbox# findmnt | grep "sysfs"                                                                                                 
|-/sys                                      sysfs                                                                                                                                 sysfs   ro,nosuid,nodev,noexec,relatime

This is a bit unexpected since the pod is privileged, so we expect sysfs to be mounted read-write. The reason sysfs is mounted read-only is that when the pod was created, the K8s pause container mounts it as read-only by default, and this read-only attribute is propagating to the other containers in the pod. This propagation occurs because sysfs is tightly coupled to the network namespace, and all containers in the pod share that namespace.

The fix is simple: remount sysfs as read-write:

root@pod-with-sysbox:~/nestybox/sysbox# mount -o remount,rw /sys /sys                                                                                          
root@pod-with-sysbox:~/nestybox/sysbox# findmnt | grep "sysfs"                                                                                                  
|-/sys                                      sysfs                                                                                                                                 sysfs   rw,relatime

After this, I was able to deploy a sys container with Docker + Sysbox without problem.

ctalledo · 2020-12-23T20:29:47Z

Closing as the problem and solution are understood. The solution is to use a CRI that supports user-namespaces (e.g., CRI-O) in order to deploy K8s pods with the sysbox runtime. This task is tracked by issue #64.

It turns out that we can't bindmount `sysfs` if we're using the unprivileged executor, which is our favorite executor to use. X-ref: nestybox/sysbox#67 (comment) This reverts commit a58ccf0.

rodnymolina added the enhancement New feature or request label Sep 12, 2020

rodnymolina self-assigned this Sep 12, 2020

rodnymolina mentioned this issue Sep 12, 2020

Add support k3OS + Sysbox #59

Open

ctalledo mentioned this issue Oct 30, 2020

Unable to mount sysfs (EPERM) during sys-container initialization #66

Closed

ctalledo changed the title ~~Unable to mount sysfs (EPERM) in shared network-ns scenarios~~ Sysbox fails mount sysfs (EPERM) during container creation in shared network-ns scenarios Nov 12, 2020

ctalledo changed the title ~~Sysbox fails mount sysfs (EPERM) during container creation in shared network-ns scenarios~~ Sysbox fails mount sysfs (EPERM) during container creation Nov 12, 2020

ctalledo changed the title ~~Sysbox fails mount sysfs (EPERM) during container creation~~ K8s + Sysbox: mount sysfs fails (EPERM) during pod creation Nov 27, 2020

ctalledo closed this as completed Dec 23, 2020

FFock mentioned this issue Apr 2, 2022

Sysbox on Kubernetes Cluster - Which pod security policy capabilities are needed to run docker-in-docker dockerd? #523

Closed

staticfloat mentioned this issue Jul 29, 2022

Revert "Bindmount the sysfs into our root_dir" staticfloat/Sandbox.jl#97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

rodnymolina commented Sep 12, 2020 •

edited

Loading

rodnymolina commented Sep 12, 2020

rodnymolina commented Sep 12, 2020

rodnymolina commented Sep 12, 2020

ctalledo commented Oct 30, 2020

kylecarbs commented Nov 10, 2020

rodnymolina commented Nov 10, 2020

ctalledo commented Nov 12, 2020

ctalledo commented Nov 12, 2020 •

edited

Loading

ctalledo commented Nov 12, 2020 •

edited

Loading

ctalledo commented Dec 23, 2020

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

K8s + Sysbox: mount sysfs fails (EPERM) during pod creation #67

Comments

rodnymolina commented Sep 12, 2020 • edited Loading

rodnymolina commented Sep 12, 2020

rodnymolina commented Sep 12, 2020

rodnymolina commented Sep 12, 2020

ctalledo commented Oct 30, 2020

kylecarbs commented Nov 10, 2020

rodnymolina commented Nov 10, 2020

ctalledo commented Nov 12, 2020

ctalledo commented Nov 12, 2020 • edited Loading

ctalledo commented Nov 12, 2020 • edited Loading

ctalledo commented Dec 23, 2020

rodnymolina commented Sep 12, 2020 •

edited

Loading

ctalledo commented Nov 12, 2020 •

edited

Loading

ctalledo commented Nov 12, 2020 •

edited

Loading