Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cgroups: add SkipDevices to Resources #2490

Merged
merged 1 commit into from
Jul 6, 2020

Conversation

kolyshkin
Copy link
Contributor

The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Fixes #2474

giuseppe
giuseppe previously approved these changes Jun 30, 2020
@kolyshkin kolyshkin marked this pull request as ready for review June 30, 2020 19:43
@kolyshkin
Copy link
Contributor Author

@cyphar PTAL

mrunalp
mrunalp previously approved these changes Jul 1, 2020
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things, but looks good otherwise.

libcontainer/cgroups/fs/fs.go Outdated Show resolved Hide resolved
libcontainer/cgroups/systemd/v1.go Outdated Show resolved Hide resolved
libcontainer/cgroups/systemd/v2.go Outdated Show resolved Hide resolved
libcontainer/cgroups/systemd/v2.go Show resolved Hide resolved
libcontainer/configs/cgroup_linux.go Show resolved Hide resolved
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Updated to address @cyphar review comments.

@cyphar
Copy link
Member

cyphar commented Jul 3, 2020

@kolyshkin

One question about the justification for this change:

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

That's a bug in runc (#2366 is tracking that), and is something we need to fix anyway (we should be either doing BPF_F_REPLACE or manually replacing the program inatomically, or even possibly reimplementing our eBPF rules to use eBPF maps that we can atomically replace).

Is that the only reason for this change (looking back at #2474, it looks like I didn't catch this when discussing the need for this)? I'd be happy to have it anyway (since the requirement for the devices cgroup doesn't make sense in certain circumstances), but I just want to make sure that we're aware that this doesn't fix the underlying issue (which will be at least somewhat more involved).

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my point about this patch not fixing the motivating cgroupv2 bug, LGTM.

@kolyshkin
Copy link
Contributor Author

I do agree this looks like a workaround for #2366.

From the other side, it does not make sense to require devices to be configured in cgroups that are not used to run containers, but rather to hold other cgroups (not sure what's the term here, parent cgroups? meta cgroups?). That is also the reason why I am thinking about renaming SkipDevices to ParentCgroup or something like that (although currently it only does what SkipDevices says).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

api, cgroupv2: skip setting the devices cgroup
4 participants