-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman stop: Unable to clean up network: unmounting network namespace: device or resource busy #19721
Comments
EBUSY should only be the error if the netns is still mounted. However we umount directly before the remove call so that should not cause problems. The umount call uses MNT_DETACH so it may not actually umount right a away. I have no idea why this flag is used and how this interacts with the nsfs mount points. |
Seeing this a lot, but again, usually with #19702 ("Storage ... removed") and I can't dual-assign flakes. So since that one already has a ton of sample failures, I'm going to start assigning flakes here. Maybe alternating. Here are last night's failures:
One of those also includes this fun set of errors, does this help?
|
I've got a little more data. There seem to be two not-quite-identical failure modes, depending on root or rootless:
That is:
They're too similar for me to split this into two separate issues, but I'll listen to the opinion of experts. HTH. |
Do you think you can remove the MNT_DETATCH? |
I have no idea why it was added in the first place, maybe it is needed? Git blame goes all the way back to 8c52aa1 which claims it needs MNT_DETATCH but provides no explanation at all why? @mheon
The root issue most be something entirely else, symptoms look like we try to cleanup twice. |
WHEW! After much suffering, I removed |
I think I originally added the MNT_DETACH flag because we were seeing intermittent failures during cleanup due to the namespace still being in use, and I was expecting that making the unmount lazy would resolve things. |
I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19702. Until these two are fixed, I can't justify the time it takes me to sort through these flakes. FWIW, here is the catalog so far:
Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite |
A friendly reminder that this issue had no activity for 30 days. |
Seen in: int/sys fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite |
I see different errors posted here that are not the same! The original report says: Just to confirm I looked at https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipnetns.c#n735 So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.But I also see: The same goes for errors like this:
Here there must be a way that we for whatever reason end up in the cleanup path twice. |
I do see the |
That is my fault, and I'm sorry. When scanning flakes to assign them to buckets, I look for common patterns but don't always compare everything exactly. I will be more careful. |
New flake. Not quite the same error message, but similar enough that I'm assigning here. From f39 rootless:
|
Another ENOENT, in f39 rootless:
|
The latest list. Note that some of these are ENOENT, and some are EBUSY. Until given a reason to treat these as different bugs, I will continue to lump them together.
|
If that provides any clues, I am observing simmilar behaviour when the container is configured with restart policy unless-stopped. |
With #18442, this is now blowing up. Different error messages, but I'm pretty sure it's all the same bug. New EACCESS variant:
and the old ENOENT variant:
...and an ENOENT variant with a shorter error message:
Here are today's failures, plus one from January
|
Still happening with brand-new (March 19) VMs, Three failures in just one CI run: int podman fedora-38 root host boltdbint podman fedora-39 rootless host sqliteint podman rawhide root host sqlite |
Is this (f39 rootless) the same error???
I'm going to assume it is, and file it as such, unless told otherwise. |
No that is most likely something different |
My periodic ping. This seems to be happening a lot more with recent VMs
|
Where are we on this one? I just saw this failure, f40 root, on VMs from containers/automation_images#349 with netavark 1.10.3-3:
(error in this one is ENOENT, not EBUSY) |
FYI: The reason you see this more is because I enabled the warnings check in AfterEach() #18442 So previously we just didn't see them, in the logs above they all failed in AfterEach, as mentioned before and in other issues the problem is that something tries to cleanup twice but I cannot see why and where this would be happening. |
Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure) |
The "pid file blah" message is new, does it help? In f39 rootless:
|
No not really |
Any workarounds? I have the same issue. |
This is (mostly) a flake tracker for CI, if you have a reproducer for this issue I would be very happy if you can share it. |
@Luap99 It seems indeed to be happen when using a wrong configuration. Sometimes the only workaround is to fully reboot the machine, as the |
Can you share podman commands, what does wrong configuration mean? |
@Luap99 I'm using Podman Quadlet. The problem occurs when a container fails to run at some point in the startup. Like you accidentally created a configuration mistake. It's possible to stop the container, but for some reason the network process doesn't know what to do anymore, and you'll end up the IPAM error or cannot clean network. The only way is to fully kill all the network related process for Podman, but even that may not be enough. By rebooting the machine, the container and it's configured network starts just fine. I haven't used Docker rootless on Linux for a long time, but I can remember the same problem. It just seems to lose track of the networking and doesn't know what is actually killed or not. As I said before, the rootlessport processes stay open, even when you stop the container. |
An example to trigger this, is by using a webserver image, Add a site/server-block in the When you now start the container(s), it will crash the main process, because of a configuration error. So you stop the container, adjust the configuration, but when starting again, you'll end up with network errors. Sometimes it can be fixed by killing all process, like This is my full config, if you're interested: https://github.com/foxws/foxws/tree/main/podman |
This one has happening A LOT in my no-retry PR. And, in case it helps, also in my parallel-system-tests PR.
|
Just as an FYI, this is the past week's worth of this bug, the instances where "netns ref counter out of sync" does not appear in error message:
|
Funny observation about this one, and I'm not 100% certain, but I think that when I see this one (the one without "netns sync"), (1) it happens in pairs on the same CI job, and (2) I see |
Note that |
Podman might call us more than once on the same path. If the path is not mounted or does not exists simply return no error. Second, retry the unmount/remove until the unmount succeeded. For some reason we must use MNT_DETACH as otherwise the unmount call will fail all time the time. However MNT_DETACH means it unmounts async in the background. Now if we call remove on the file and the unmount was not done yet it will fail with EBUSY. In this case we try again until it works or we get another error. This should help containers/podman#19721 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Podman might call us more than once on the same path. If the path is not mounted or does not exists simply return no error. Second, retry the unmount/remove until the unmount succeeded. For some reason we must use MNT_DETACH as otherwise the unmount call will fail all time the time. However MNT_DETACH means it unmounts async in the background. Now if we call remove on the file and the unmount was not done yet it will fail with EBUSY. In this case we try again until it works or we get another error. This should help containers/podman#19721 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Podman might call us more than once on the same path. If the path is not mounted or does not exists simply return no error. Second, retry the unmount/remove until the unmount succeeded. For some reason we must use MNT_DETACH as otherwise the unmount call will fail all time the time. However MNT_DETACH means it unmounts async in the background. Now if we call remove on the file and the unmount was not done yet it will fail with EBUSY. In this case we try again until it works or we get another error. This should help containers/podman#19721 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Podman might call us more than once on the same path. If the path is not mounted or does not exists simply return no error. Second, retry the unmount/remove until the unmount succeeded. For some reason we must use MNT_DETACH as otherwise the unmount call will fail all time the time. However MNT_DETACH means it unmounts async in the background. Now if we call remove on the file and the unmount was not done yet it will fail with EBUSY. In this case we try again until it works or we get another error. This should help containers/podman#19721 Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: tomsweeneyredhat <tsweeney@redhat.com>
We cannot unlock then lock again without syncing the state as this will then save a potentially old state causing very bad things, such as double netns cleanup issues. The fix here is simple move the saveContainerError() under the same lock. The comment about the re-lock is just wrong. Not doing this under the same lock would cause us to update the error after something else changed the container alreayd. Most likely this was caused by a misunderstanding on how go defer's work. Given they run Last In - First Out (LIFO) it is safe as long as out defer function is after the defer unlock() call. I think this issue is very bad and might have caused a variety of other weird flakes. As fact I am confident that this fixes the double cleanup errors. Fixes containers#21569 Also fixes the netns removal ENOENT issues seen in containers#19721. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
We cannot unlock then lock again without syncing the state as this will then save a potentially old state causing very bad things, such as double netns cleanup issues. The fix here is simple move the saveContainerError() under the same lock. The comment about the re-lock is just wrong. Not doing this under the same lock would cause us to update the error after something else changed the container alreayd. Most likely this was caused by a misunderstanding on how go defer's work. Given they run Last In - First Out (LIFO) it is safe as long as out defer function is after the defer unlock() call. I think this issue is very bad and might have caused a variety of other weird flakes. As fact I am confident that this fixes the double cleanup errors. Fixes containers#21569 Also fixes the netns removal ENOENT issues seen in containers#19721. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Fixed in #23519 |
Seeing this quite often in my PR with #18442 cherrypicked:
Example: int f38 rootless.
It almost always happens together with the "Storage ... removed" flake (#19702), e.g.:
So I mostly file under that issue, because my flake tool has no provision for multiple buckets.
No pattern yet (that I can see) in when it fails, which tests, etc.
The text was updated successfully, but these errors were encountered: