Joining cgroups blindly causes performance problems #861

cyphar · 2016-05-31T10:17:01Z

It turns out that joining cgroups that we don't use has a non-zero performance degredation. The most obvious case is with blkio which can cause operations to become 10 times slower. The following test assumes you have some block device /dev/sdX that is formatted as ext4 (this used a spinning hard drive, but you could also use a flash usb):

# mount /dev/sdX /workspace
# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
# time dd if=/dev/zero of=/workspace/test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 2.09553 s, 244 kB/s

real    0m2.097s
user    0m0.000s
sys     0m0.144s
# mkdir /sys/fs/cgroup/blkio/test
# echo $$ >/sys/fs/cgroup/blkio/test/cgroup.procs
# time dd if=/dev/zero of=/workspace/test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 19.5512 s, 26.2 kB/s

real    0m19.553s
user    0m0.000s
sys     0m0.132s

This is already a known issue upstream (in the kernel), but it is a general problem (most cgroup controllers have special cases for their root cgroup to maximise performance -- but few have optimisations to reduce the hierarchy based on which cgroups have limits set).

Unfortunately, the naive solution (not joining cgroups if we don't intend to use them in config.json) causes obvious issues with runc update (and commands that make assumptions about which cgroups we've joined). So we'd have to write quite a bit of code to create new cgroups and join container processes to them if the user requests a limit that wasn't required before. We could do it with the freezer cgroup and some enumeration.

This is (somewhat) related to the lazy cgroup work that we should do as a part of #774.

The performance issue described in moby/moby#21485 occurs because of this.

The text was updated successfully, but these errors were encountered:

dqminh · 2016-05-31T12:49:41Z

mmm the performance degradation is known but i didnt know that it was that much for blkio 😵
Do you have a link to the related discussion in ML if there's any ?
On the other hand, not joining unspecified cgroups in config.json SGTM.

cyphar · 2016-05-31T13:19:58Z

@dqminh Unfortunately I discovered this after having discussions with our kernel team on internal mailing lists. I'll ask them if they can link to upstream discussion though.

As for not joining unspecified cgroups, I'll work on this once I figure out why ubuntu isn't affected by this as badly.

cyphar · 2016-06-07T23:20:49Z

This could potentially also be fixed by the lazy cgroup handling (by only attaching to cgroups that we are using, and then attaching later if a user tries to update the limits).

/cc @brauner @monstermunchkin

dqminh · 2016-06-08T17:57:21Z

@cyphar can you add instructions on how to replicate this issue reliably ( what is the environment of the host, kernel version etc. ) ? I've tried with both bare-metal with kernel 4.4 and on a ubuntu xenial VM on DigitalOcean but enable to replicate it.

gfyrag · 2016-06-24T15:02:28Z

Hello, same pb here.
I can not find the config to apply on the config.json file, do you know where i can find it?

lanrat · 2017-06-28T21:28:43Z

I too am running into this issue. Has anyone found a solution, even if it is a hack or temporary?

I haven't found a way to get runc to work without cgroups...

rodrigooshiro · 2017-06-29T12:13:30Z

That's old... It was fixed on newer versions of docker, in my case the problem was solved changing a systemd config file:

Read more about it here:
kubernetes/kubernetes#39682

LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

cyphar · 2017-06-29T17:42:03Z

@ipeoshir This is a separate issue. I found the rlimit issue while trying to figure out this issue. Those limits are non-cgroup limits.

qianzhangxa · 2017-08-25T01:13:54Z

@cyphar It seems that I can not reproduce this issue in my env:

# pwd
/var/lib/qzhang
# df -mh .
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb        37G   76M   35G   1% /var/lib
# mount | grep xvdb
/dev/xvdb on /var/lib type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvdb on /var/lib/docker/overlay type ext4 (rw,relatime,seclabel,data=ordered)

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
# time dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.451274 s, 1.1 MB/s

real    0m0.490s
user    0m0.000s
sys     0m0.035s

# mkdir /sys/fs/cgroup/blkio/test
# echo $$ >/sys/fs/cgroup/blkio/test/cgroup.procs
# time dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.481559 s, 1.1 MB/s

real    0m0.484s
user    0m0.000s
sys     0m0.041s

So as you see, there is no performance downgrade in my test. I am using CoreOS and CentOS 7.2, and the kernel version is 4.7.3-coreos-r3 and 3.10 (CentOS 7.2). So it seems this is not a general performance issue? Or at least not applied to CoreOS and CentOS 7.2? And do you know if there is a ticket created in kernel to trace this issue?

cyphar · 2017-08-25T03:21:30Z

@qianzhangxa We had an internal bug about this issue. I will have to go check again what the exact reproducer was, and double check that it still occurs. But it definitely was happening on a stock kernel last time I tried it. What blkio scheduler are you using? I believe that if you're using deadline that this bug doesn't happen, it only happens on CFQ.

qianzhangxa · 2017-08-25T03:27:36Z

Yes, I am using deadline scheduler, I will try CFQ later.

And I see in blkio cgroups, there is a CFQ control policy, do you know the difference between CFQ control policy and CFQ scheduler? I can use CFQ control policy without CFQ scheduler?

cyphar · 2017-08-25T03:34:15Z

I assume you'd need to use a cfq scheduler in order to control policies related to the cfq scheduler? I'm not sure though.

qianzhangxa · 2017-08-28T01:59:25Z

@cyphar I did some experiments, and I think this performance issue will happen only when the IO scheduler for the disk is set to cfq and the filesystem is ext4/ext3 with the data=ordered option.

And to use the cgroup blkio control functionalities, we have to set the IO scheduler to cfq, if it is set to deadline, all the blkio.weight, blkio.weight_device and blkio.leaf_weight[_device] proportional weight policy files will NOT take effect.

cyphar · 2017-08-28T12:14:50Z

@qianzhangxa

I did some experiments, and I think this performance issue will happen only when the IO scheduler for the disk is set to cfq and the filesystem is ext4/ext3 with the data=ordered option.

The performance issue is only as obvious as this with ext4 and data=ordered. In the original bug report it looks like it's a core CFQ problem (and some discussions with kernel devs have confirmed this). The main reason for me opening this issue is that it's not necessary for us to join cgroups if we're not going to set any limits (and we can always move the process on runc update anyway) -- with the obvious exception of the cgroupsPath setting. There are some other possible performance implications in some other cgroups but they're not as obvious.

qianzhangxa · 2017-08-28T12:43:28Z

@cyphar Got it, thanks!

BTW, for that core CFQ problem, do you know if there is any ticket in kernel community to trace it? I'd like to get more details.

cyphar · 2017-08-28T13:21:28Z

I don't think there is one. Most of the discussion was on some internal mailing lists with our kernel devs, and I think the conversation stalled.

qianzhangxa · 2017-09-21T14:32:47Z

@cyphar I found this performance may not be related to blkio cgroup, because I found even a process does not join any blkio cgroup, the performance issue will happen as well.

# At this point, the shell process does not join any blkio cgroup
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync         
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 16.2307 s, 31.5 kB/s    <--- Performance issue

# echo $$ > /sys/fs/cgroup/blkio/test/cgroup.procs    <--- Join a sub blkio cgroup
root@workstation:/home/stack# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 16.252 s, 31.5 kB/s    <---Performance issue

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs    <--- Join the root blkio cgroup
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.19245 s, 429 kB/s    <--- No performance issue

So I think the behavior is:

If the process does not join any blkio cgroups, the performance issue will happen.
If the process joins a sub blkio cgroup, the performance issue will happen.
If the process joins the root blkio cgroup, the performance issue will not happen.

What is confusing me is 3, I am not sure why the performance issue will not happen only when the process joins the root blkio cgroup. Any ideas?

cyphar · 2017-09-22T22:03:05Z

@qianzhangxa Your options (1) and (3) are the same, unless you didn't mount the blkio cgroup at all until the second step. "Not joining" in this context means staying in the root -- all processes start in the root cgroup (once a hierarchy is mounted). However, the reason why (3) has no performance impact is because the way blkio weighting between two cgroups is done involves adding latency to competing cgroups (to avoid CFQ giving more weight to a cgroup incorrectly). This logic doesn't apply in the root cgroup because there are no competing cgroups to the root.

qianzhangxa · 2017-09-25T02:32:09Z

@cyphar The blkio cgroup was automatically mounted when the OS was booted.

# mount | grep blkio 
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync    
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 7.7286 s, 66.2 kB/s

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs       
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.25528 s, 408 kB/s

The above test was done in Ubuntu 17.04 and the test in my last post was done in Ubuntu 16.04. I also thought "Not joining" in this context means staying in the root -- all processes start in the root cgroup, so I thought for (1) and (3), I should have got the similar test result, but as you see in the above test, compared with (3), the (1) has significant performance issue which is really confusing me.

hqhq · 2017-09-26T01:22:13Z

@qianzhangxa There's no not-joining processes, every process is in some cgroup if you enabled an subsystem and mounted it. And it's also not always true that not-joining process means it's in root process, your case is simply because your shell process was in some sub-cgroup after os boots which was controlled by systemd, see cat /proc/self/cgroup you can confirm it, on ubuntu, it's usually in some sub-cgroup like /user/1000.user/1.session.

qianzhangxa · 2017-09-29T01:49:24Z

Yes @hqhq! I see the process is initially in the sub-group /sys/fs/cgroup/blkio/user.slice, that's why both (1) and (2) have the performance issue. Thanks!

eero-t · 2020-03-24T12:14:59Z

it's not necessary for us to join cgroups if we're not going to set any limits (and we can always move the process on runc update anyway).

Moving isn't valid strategy for all controllers, read the "Memory ownership" section in the documentation: https://www.kernel.org/doc/Documentation/cgroup-v2.txt

"A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released. Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup."

cyphar mentioned this issue May 31, 2016

Slow IO performance inside container compared with the host. moby/moby#21485

Closed

wking mentioned this issue Jun 3, 2016

runtimetest: add validation of cgroups opencontainers/runtime-tools#93

Merged

wking mentioned this issue Dec 9, 2017

Add definitions for RDMA controller/cgroup of Linux kernel 4.11 opencontainers/runtime-spec#942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining cgroups blindly causes performance problems #861

Joining cgroups blindly causes performance problems #861

cyphar commented May 31, 2016 •

edited

Loading

dqminh commented May 31, 2016

cyphar commented May 31, 2016

cyphar commented Jun 7, 2016 •

edited

Loading

dqminh commented Jun 8, 2016

gfyrag commented Jun 24, 2016

lanrat commented Jun 28, 2017

rodrigooshiro commented Jun 29, 2017

cyphar commented Jun 29, 2017 •

edited

Loading

qianzhangxa commented Aug 25, 2017

cyphar commented Aug 25, 2017

qianzhangxa commented Aug 25, 2017

cyphar commented Aug 25, 2017

qianzhangxa commented Aug 28, 2017

cyphar commented Aug 28, 2017

qianzhangxa commented Aug 28, 2017

cyphar commented Aug 28, 2017

qianzhangxa commented Sep 21, 2017

cyphar commented Sep 22, 2017

qianzhangxa commented Sep 25, 2017

hqhq commented Sep 26, 2017 •

edited

Loading

qianzhangxa commented Sep 29, 2017

eero-t commented Mar 24, 2020

Joining cgroups blindly causes performance problems #861

Joining cgroups blindly causes performance problems #861

Comments

cyphar commented May 31, 2016 • edited Loading

dqminh commented May 31, 2016

cyphar commented May 31, 2016

cyphar commented Jun 7, 2016 • edited Loading

dqminh commented Jun 8, 2016

gfyrag commented Jun 24, 2016

lanrat commented Jun 28, 2017

rodrigooshiro commented Jun 29, 2017

cyphar commented Jun 29, 2017 • edited Loading

qianzhangxa commented Aug 25, 2017

cyphar commented Aug 25, 2017

qianzhangxa commented Aug 25, 2017

cyphar commented Aug 25, 2017

qianzhangxa commented Aug 28, 2017

cyphar commented Aug 28, 2017

qianzhangxa commented Aug 28, 2017

cyphar commented Aug 28, 2017

qianzhangxa commented Sep 21, 2017

cyphar commented Sep 22, 2017

qianzhangxa commented Sep 25, 2017

hqhq commented Sep 26, 2017 • edited Loading

qianzhangxa commented Sep 29, 2017

eero-t commented Mar 24, 2020

cyphar commented May 31, 2016 •

edited

Loading

cyphar commented Jun 7, 2016 •

edited

Loading

cyphar commented Jun 29, 2017 •

edited

Loading

hqhq commented Sep 26, 2017 •

edited

Loading