-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sysbox-fs high cpu usage, infinite unmount calls on shared device #808
Comments
Hi @sibidharan, thanks for trying Sysbox and reporting the issue.
I don't know that sharing the device will work, it will probably show up with Regarding the continous unmounts: maybe there's a bug in Sysbox, but let me provide a bit of context so you can better understand what's going on (and hopefully help us root cause the problem). Sysbox intercepts the My guess is that what's happening is that systemd inside the Sysbox container is unmounting Let's start by having you post the output of Thanks! |
Yes, for this I created a udev rule that chmod the As far the
|
Hi @sibidharan, thanks for the Looking again at this, I think systemd is actually recursively bind-mounting the container's rootfs ( Assuming I am right, you can try the following to manually reproduce (inside the Sysbox container):
After the last |
Thank you for this valuable input, I was able to boildown the process thats issuing infinite call under
I indeed took this
Here is the output:
|
I see ... interesting; it must be related to systemd though, since it's trying to unmount under a systemd path (
Sorry I meant to take the findmnt when the It may be that you need to issue |
Yes, I run findmnt with |
Got it ... yes increase it to 0.5 secs (the minimum supported I believe): Hopefully that works; if it doesn't then we need to think of another approach to debug it ... |
If that doesn't work, maybe the unmount is happening in a dedicated mount namespace. In that case, from inside the sysbox container, find the PID of the
Or maybe:
|
HI @ctalledo , sorry for delayed response. I found more info about the
and this timer starts the this automatically on every sunday, and I guess this is not needed inside the container - am I correct? Disabling these services are enough or this has to be fixed in Sysbox also?
However, nothing showed up under watch, I am trying the
nsenter didn't work btw. Hope this is enough info for you, let me know if you want more info. |
Hi @sibidharan, Thanks for the extra info.
Since we are having trouble understanding the issue fully, I can't yet say if it's a Sysbox bug or not. I think the easiest path would be to disable the |
I can confirm disabling all e2scrub related services fixes the issue as of now. |
Hey @ctalledo, Thank you for looking into that. I’m experiencing the same problem, but with a slight difference. In my case, e2scrub is not running inside the container at all, and disabling the service on the Sysbox host also doesn't resolve the issue. The CPU is still at 100% usage due to Sysbox FS. Logs are pretty much similar. Is there any solution for this? |
The issue appeared out of nowhere after rebooting the host. I didn’t perform any updates or make any changes. I simply rebooted the hosts, and the problem appeared on three out of five of them. Furthermore, when I try to set up new hosts, they also exhibit this issue right from the start. This is very strange, and there seems to be some dependency on something that I can’t figure out. |
Hi @bushev,
Mmm ... strange. Must be the workload inside the sysbox container triggering the problem. Is there an image I can use to reproduce? Any info that allows me to reproduce this locally would help. For context, one area that could trigger high CPU usage by sysbox-fs is when the Sysbox container is executing lots of |
We are not making many requests to Our image is over 20 GB, so it’s not easy to share it directly. I’ll need to slim it down and make sure the issue still reproduces, which will take some time. If the problem persists, I’ll get back to you. |
Since this issue affected our production environment, I had to find a quick workaround to keep the service running. The only solution I’ve found so far is to restart the Sysbox service on the host immediately after starting each new container. This drops the CPU load from 80% down to just a few percent. So far, I haven’t noticed any disruptions in the operation of the containers or the applications running inside them. I wanted to share this workaround here in case anyone else encounters the same problem and needs a temporary solution. |
Hi, I can reproduce a similar issue using the nestybox docker image. Hope that helps :) services:
dind-sandbox:
image: nestybox/ubuntu-noble-systemd-docker:latest
container_name: dind_sandbox
restart: unless-stopped
runtime: sysbox-runc Note: Enabling debug logs on sysbox-fs slow it down enough to not use all the CPU.
[EDIT]: Added sysbox infos |
Hi @Kl0ven, thanks for reporting. Yes I am able to reproduce with the Thanks. |
For the I need to understand what that service is doing and why it's consuming so many CPU cycles when running inside a Sysbox container. I suspect it's walking the container filesystem and causing high CPU utilization once it hits the files under I disabled the service and CPU utilization goes back to normal.
|
I opted for a quick solution of disabling the I've pushed the new image already to Dockerhub, so you should not see the problem using that image. A "proper fix" will require digging down a bit more to figure out what that service is doing when running inside a Sysbox container that is causing CPU utilization to be so high. |
Should also disable the related timer?
|
Hi @sibidharan, Thanks; yes, we should disable that |
I have narrowed down the issue and created a minimal Dockerfile to reproduce the problem. It appears that the anomaly occurs after installing FROM nestybox/ubuntu-noble-systemd-docker:latest AS base
RUN apt update && apt install -y mono-devel
CMD ["/sbin/init"] The following commands illustrate the build and runtime environment: docker build -t test -f Dockerfile .
docker run --runtime sysbox-runc -it --name test --rm test |
When I launch KASM container with sysbox with a GPU by sharing
--device=/dev/dri/renderD128
, sysbox-fs logs go crazy and it goes high CPU usage. I enabled logs and I see thisIf I restart sysbox-fs service, this issue goes away temporarily on deployed containers (unable to docker exec the running containers afterwards), but if I deploy a new container, this issue again starts while sharing devices or somewhere else (?).
Any reason what causes the infinite loop of
/run/systemd/mount-rootfs/sys/devices/virtual
unmount call that goes away when sysbox-fs is restarted?Log File: sysbox-fs.log
(After some researching...)
I can see a lot of
"Received umount syscall from pid 1092497"
for different targets, and they seem to go perfectly. I just searched for the first occurrence ofumount
in the log file, tracing every umount call.From line 7379 of log file we can see the first occurance of umount call to
/run/systemd/mount-rootfs/sys/devices/virtual
that gets ignored, and from then on its just an infinite loop, for every container I deploy with a device, this just adds up and the log file is full of this messages, I have to turn off the debug log else its consuming lotta storage. This just don't stop, only if I pass the--device=/dev/dri/renderD128
, and with the little knowledge I have, I am able to understand this infinite umount calls should be related to this device I passed, somehow causing an infinite loop.I went through the code located at https://github.com/nestybox/sysbox-fs/blob/master/nsenter/utils.go - this file has a potential possibility to go on a cleanup loop that could repeatedly send unmount calls, that later gets ignored by seccomp, as shown in the log, from here: https://github.com/nestybox/sysbox-fs/blob/4c2bc153f33af1bd30a227a14ecfc8174ff280d5/seccomp/umount.go#L128
Can we skip these devices from unmounting that are for sure going to get ignored by seccomp thus saving lot of CPU? Is my understanding of whats going on is correct? If so, how to solve this issue?
The text was updated successfully, but these errors were encountered: