Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining cgroups blindly causes performance problems #861

Open
cyphar opened this issue May 31, 2016 · 22 comments
Open

Joining cgroups blindly causes performance problems #861

cyphar opened this issue May 31, 2016 · 22 comments

Comments

@cyphar
Copy link
Member

cyphar commented May 31, 2016

It turns out that joining cgroups that we don't use has a non-zero performance degredation. The most obvious case is with blkio which can cause operations to become 10 times slower. The following test assumes you have some block device /dev/sdX that is formatted as ext4 (this used a spinning hard drive, but you could also use a flash usb):

# mount /dev/sdX /workspace
# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
# time dd if=/dev/zero of=/workspace/test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 2.09553 s, 244 kB/s

real    0m2.097s
user    0m0.000s
sys     0m0.144s
# mkdir /sys/fs/cgroup/blkio/test
# echo $$ >/sys/fs/cgroup/blkio/test/cgroup.procs
# time dd if=/dev/zero of=/workspace/test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 19.5512 s, 26.2 kB/s

real    0m19.553s
user    0m0.000s
sys     0m0.132s 

This is already a known issue upstream (in the kernel), but it is a general problem (most cgroup controllers have special cases for their root cgroup to maximise performance -- but few have optimisations to reduce the hierarchy based on which cgroups have limits set).

Unfortunately, the naive solution (not joining cgroups if we don't intend to use them in config.json) causes obvious issues with runc update (and commands that make assumptions about which cgroups we've joined). So we'd have to write quite a bit of code to create new cgroups and join container processes to them if the user requests a limit that wasn't required before. We could do it with the freezer cgroup and some enumeration.

This is (somewhat) related to the lazy cgroup work that we should do as a part of #774.

The performance issue described in moby/moby#21485 occurs because of this.

@dqminh
Copy link
Contributor

dqminh commented May 31, 2016

mmm the performance degradation is known but i didnt know that it was that much for blkio 😵
Do you have a link to the related discussion in ML if there's any ?
On the other hand, not joining unspecified cgroups in config.json SGTM.

@cyphar
Copy link
Member Author

cyphar commented May 31, 2016

@dqminh Unfortunately I discovered this after having discussions with our kernel team on internal mailing lists. I'll ask them if they can link to upstream discussion though.

As for not joining unspecified cgroups, I'll work on this once I figure out why ubuntu isn't affected by this as badly.

@cyphar
Copy link
Member Author

cyphar commented Jun 7, 2016

This could potentially also be fixed by the lazy cgroup handling (by only attaching to cgroups that we are using, and then attaching later if a user tries to update the limits).

/cc @brauner @monstermunchkin

@dqminh
Copy link
Contributor

dqminh commented Jun 8, 2016

@cyphar can you add instructions on how to replicate this issue reliably ( what is the environment of the host, kernel version etc. ) ? I've tried with both bare-metal with kernel 4.4 and on a ubuntu xenial VM on DigitalOcean but enable to replicate it.

@gfyrag
Copy link

gfyrag commented Jun 24, 2016

Hello, same pb here.
I can not find the config to apply on the config.json file, do you know where i can find it?

@lanrat
Copy link

lanrat commented Jun 28, 2017

I too am running into this issue. Has anyone found a solution, even if it is a hack or temporary?

I haven't found a way to get runc to work without cgroups...

@rodrigooshiro
Copy link

That's old... It was fixed on newer versions of docker, in my case the problem was solved changing a systemd config file:

Read more about it here:
kubernetes/kubernetes#39682

LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

@cyphar
Copy link
Member Author

cyphar commented Jun 29, 2017

@ipeoshir This is a separate issue. I found the rlimit issue while trying to figure out this issue. Those limits are non-cgroup limits.

@qianzhangxa
Copy link

@cyphar It seems that I can not reproduce this issue in my env:

# pwd
/var/lib/qzhang
# df -mh .
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb        37G   76M   35G   1% /var/lib
# mount | grep xvdb
/dev/xvdb on /var/lib type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvdb on /var/lib/docker/overlay type ext4 (rw,relatime,seclabel,data=ordered)

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs
# time dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.451274 s, 1.1 MB/s

real    0m0.490s
user    0m0.000s
sys     0m0.035s

# mkdir /sys/fs/cgroup/blkio/test
# echo $$ >/sys/fs/cgroup/blkio/test/cgroup.procs
# time dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.481559 s, 1.1 MB/s

real    0m0.484s
user    0m0.000s
sys     0m0.041s

So as you see, there is no performance downgrade in my test. I am using CoreOS and CentOS 7.2, and the kernel version is 4.7.3-coreos-r3 and 3.10 (CentOS 7.2). So it seems this is not a general performance issue? Or at least not applied to CoreOS and CentOS 7.2? And do you know if there is a ticket created in kernel to trace this issue?

@cyphar
Copy link
Member Author

cyphar commented Aug 25, 2017

@qianzhangxa We had an internal bug about this issue. I will have to go check again what the exact reproducer was, and double check that it still occurs. But it definitely was happening on a stock kernel last time I tried it. What blkio scheduler are you using? I believe that if you're using deadline that this bug doesn't happen, it only happens on CFQ.

@qianzhangxa
Copy link

Yes, I am using deadline scheduler, I will try CFQ later.

And I see in blkio cgroups, there is a CFQ control policy, do you know the difference between CFQ control policy and CFQ scheduler? I can use CFQ control policy without CFQ scheduler?

@cyphar
Copy link
Member Author

cyphar commented Aug 25, 2017

I assume you'd need to use a cfq scheduler in order to control policies related to the cfq scheduler? I'm not sure though.

@qianzhangxa
Copy link

@cyphar I did some experiments, and I think this performance issue will happen only when the IO scheduler for the disk is set to cfq and the filesystem is ext4/ext3 with the data=ordered option.

And to use the cgroup blkio control functionalities, we have to set the IO scheduler to cfq, if it is set to deadline, all the blkio.weight, blkio.weight_device and blkio.leaf_weight[_device] proportional weight policy files will NOT take effect.

@cyphar
Copy link
Member Author

cyphar commented Aug 28, 2017

@qianzhangxa

I did some experiments, and I think this performance issue will happen only when the IO scheduler for the disk is set to cfq and the filesystem is ext4/ext3 with the data=ordered option.

The performance issue is only as obvious as this with ext4 and data=ordered. In the original bug report it looks like it's a core CFQ problem (and some discussions with kernel devs have confirmed this). The main reason for me opening this issue is that it's not necessary for us to join cgroups if we're not going to set any limits (and we can always move the process on runc update anyway) -- with the obvious exception of the cgroupsPath setting. There are some other possible performance implications in some other cgroups but they're not as obvious.

@qianzhangxa
Copy link

@cyphar Got it, thanks!

BTW, for that core CFQ problem, do you know if there is any ticket in kernel community to trace it? I'd like to get more details.

@cyphar
Copy link
Member Author

cyphar commented Aug 28, 2017

I don't think there is one. Most of the discussion was on some internal mailing lists with our kernel devs, and I think the conversation stalled.

@qianzhangxa
Copy link

@cyphar I found this performance may not be related to blkio cgroup, because I found even a process does not join any blkio cgroup, the performance issue will happen as well.

# At this point, the shell process does not join any blkio cgroup
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync         
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 16.2307 s, 31.5 kB/s    <--- Performance issue

# echo $$ > /sys/fs/cgroup/blkio/test/cgroup.procs    <--- Join a sub blkio cgroup
root@workstation:/home/stack# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 16.252 s, 31.5 kB/s    <---Performance issue

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs    <--- Join the root blkio cgroup
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.19245 s, 429 kB/s    <--- No performance issue

So I think the behavior is:

  1. If the process does not join any blkio cgroups, the performance issue will happen.
  2. If the process joins a sub blkio cgroup, the performance issue will happen.
  3. If the process joins the root blkio cgroup, the performance issue will not happen.

What is confusing me is 3, I am not sure why the performance issue will not happen only when the process joins the root blkio cgroup. Any ideas?

@cyphar
Copy link
Member Author

cyphar commented Sep 22, 2017

@qianzhangxa Your options (1) and (3) are the same, unless you didn't mount the blkio cgroup at all until the second step. "Not joining" in this context means staying in the root -- all processes start in the root cgroup (once a hierarchy is mounted). However, the reason why (3) has no performance impact is because the way blkio weighting between two cgroups is done involves adding latency to competing cgroups (to avoid CFQ giving more weight to a cgroup incorrectly). This logic doesn't apply in the root cgroup because there are no competing cgroups to the root.

@qianzhangxa
Copy link

@cyphar The blkio cgroup was automatically mounted when the OS was booted.

# mount | grep blkio 
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync    
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 7.7286 s, 66.2 kB/s

# echo $$ > /sys/fs/cgroup/blkio/cgroup.procs       
# dd if=/dev/zero of=test.bin bs=512 count=1000 oflag=dsync 
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.25528 s, 408 kB/s

The above test was done in Ubuntu 17.04 and the test in my last post was done in Ubuntu 16.04. I also thought "Not joining" in this context means staying in the root -- all processes start in the root cgroup, so I thought for (1) and (3), I should have got the similar test result, but as you see in the above test, compared with (3), the (1) has significant performance issue which is really confusing me.

@hqhq
Copy link
Contributor

hqhq commented Sep 26, 2017

@qianzhangxa There's no not-joining processes, every process is in some cgroup if you enabled an subsystem and mounted it. And it's also not always true that not-joining process means it's in root process, your case is simply because your shell process was in some sub-cgroup after os boots which was controlled by systemd, see cat /proc/self/cgroup you can confirm it, on ubuntu, it's usually in some sub-cgroup like /user/1000.user/1.session.

@qianzhangxa
Copy link

Yes @hqhq! I see the process is initially in the sub-group /sys/fs/cgroup/blkio/user.slice, that's why both (1) and (2) have the performance issue. Thanks!

@eero-t
Copy link

eero-t commented Mar 24, 2020

it's not necessary for us to join cgroups if we're not going to set any limits (and we can always move the process on runc update anyway).

Moving isn't valid strategy for all controllers, read the "Memory ownership" section in the documentation: https://www.kernel.org/doc/Documentation/cgroup-v2.txt

"A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released. Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants