Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman stop: Unable to clean up network: unmounting network namespace: device or resource busy #19721

Closed
edsantiago opened this issue Aug 23, 2023 · 43 comments
Assignees
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

Seeing this quite often in my PR with #18442 cherrypicked:

? Enter [AfterEach] TOP-LEVEL
$ podman [options] stop --all -t 0
time="2023-08-22T18:28:28-05:00" level=error msg="Unable to clean up network for container XYZ:
    "unmounting network namespace for container XYZ:
        failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
            remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Example: int f38 rootless.

It almost always happens together with the "Storage ... removed" flake (#19702), e.g.:

So I mostly file under that issue, because my flake tool has no provision for multiple buckets.

No pattern yet (that I can see) in when it fails, which tests, etc.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Aug 23, 2023
@Luap99
Copy link
Member

Luap99 commented Aug 24, 2023

EBUSY should only be the error if the netns is still mounted. However we umount directly before the remove call so that should not cause problems.

The umount call uses MNT_DETACH so it may not actually umount right a away. I have no idea why this flag is used and how this interacts with the nsfs mount points.

@edsantiago
Copy link
Member Author

Seeing this a lot, but again, usually with #19702 ("Storage ... removed") and I can't dual-assign flakes. So since that one already has a ton of sample failures, I'm going to start assigning flakes here. Maybe alternating. Here are last night's failures:

  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • rawhide : int podman rawhide rootless host sqlite
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

One of those also includes this fun set of errors, does this help?

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

@edsantiago
Copy link
Member Author

I've got a little more data. There seem to be two not-quite-identical failure modes, depending on root or rootless:

f38 rootless:

Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to remove ns path /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365:
    remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy

f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

  1. root has two IPAM errors
  2. root has a "tearing down ... open container netns" error
  3. rootless says "failed to remove ns path ... EBUSY", root says "failed to unmount NS ... ENOENT"

They're too similar for me to split this into two separate issues, but I'll listen to the opinion of experts.

HTH.

@rhatdan
Copy link
Member

rhatdan commented Aug 24, 2023

Do you think you can remove the MNT_DETATCH?

@Luap99
Copy link
Member

Luap99 commented Aug 25, 2023

I have no idea why it was added in the first place, maybe it is needed?

Git blame goes all the way back to 8c52aa1 which claims it needs MNT_DETATCH but provides no explanation at all why? @mheon
This code was forked from cni upstream and that one never used it so...


f38 root:

IPAM error: failed to get ips for container ID XYZ on network podman
IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman
tearing down network namespace configuration for container XYZ:
    netavark: open container netns:
    open /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: IO error: No such file or directory (os error 2)
Unable to clean up network for container XYZ:
    unmounting network namespace for container XYZ:
    failed to unmount NS: at /run/netns/netns-050fe4d4-d5da-48ff-3b6b-9b4a1f445235: no such file or directory

That is:

1. root has two IPAM errors

2. root has a "tearing down ... open container netns" error

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

@edsantiago
Copy link
Member Author

WHEW! After much suffering, I removed MNT_DETACH. That causes absolutely everything to fail hard, even system tests which so far have been immune to this flake.

@mheon
Copy link
Member

mheon commented Aug 26, 2023

I think I originally added the MNT_DETACH flag because we were seeing intermittent failures during cleanup due to the namespace still being in use, and I was expecting that making the unmount lazy would resolve things.

@edsantiago
Copy link
Member Author

I'm giving up on this: I am pulling the stderr-on-teardown checks from my flake-check PR. It's too much, costing me way too much time between this and #19702. Until these two are fixed, I can't justify the time it takes me to sort through these flakes.

FWIW, here is the catalog so far:

  • fedora-37 : int podman fedora-37 root container sqlite
    • 08-28 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-27 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-26 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-26 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-24 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-23 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 08-28 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
    • 08-28 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 08-24 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-24 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
    • 08-24 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 root container sqlite
    • 08-24 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-28 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 08-28 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
    • 08-28 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
    • 08-24 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-24 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 08-22 19:57 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-27 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 08-28 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-28 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 08-28 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
    • 08-26 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
    • 08-24 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
  • rawhide : int podman rawhide rootless host sqlite
    • 08-28 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-27 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-26 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
    • 08-26 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options

Seen in: int podman fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

  • fedora-37 : int podman fedora-37 root container sqlite
    • 08-28-2023 08:57 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-27-2023 09:08 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-26-2023 18:25 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-24-2023 11:25 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-23-2023 23:14 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 08-28-2023 21:50 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test HostAliases
    • 08-28-2023 13:20 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28-2023 08:56 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-24-2023 19:02 in TOP-LEVEL [AfterEach] Podman kube generate podman generate kube sharing pid namespace
    • 08-24-2023 11:21 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • fedora-38 : int podman fedora-38 root container sqlite
    • 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman exec podman exec preserves container groups with --user and --group-add
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-28-2023 21:52 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 08-28-2023 13:22 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube replace non-existing pod
    • 08-28-2023 11:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
    • 08-24-2023 19:01 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-24-2023 11:23 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with DirectoryOrCreate HostPath type volume
  • fedora-38 : int podman fedora-38 rootless host boltdb
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-27-2023 09:05 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 08-28-2023 21:55 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 08-28-2023 13:19 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 08-28-2023 08:53 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 08-27-2023 21:41 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test with customized hostname
    • 08-26-2023 22:23 in TOP-LEVEL [AfterEach] Podman play kube podman play kube test optional env value from missing configmap
    • 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman play kube podman play kube --no-host
  • rawhide : int podman rawhide rootless host sqlite
    • 08-28-2023 11:36 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 08-27-2023 21:40 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman play kube podman play kube multi doc yaml with multiple services, pods and deployments
    • 08-26-2023 20:35 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 08-24-2023 19:00 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 08-23-2023 23:11 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
  • rawhide : sys podman rawhide rootless host sqlite

Seen in: int/sys fedora-37/fedora-38/rawhide root/rootless container/host boltdb/sqlite

@Luap99
Copy link
Member

Luap99 commented Sep 29, 2023

I see different errors posted here that are not the same!

The original report says:
remove /run/user/3418/netns/netns-9965a9b5-facb-7fe9-44e3-f99ec7d69365: device or resource busy\""

Just to confirm I looked at ip netns which also does a simple bind mount for the netns paths and to delete they also simply call umount with MNT_DETACH followed by unlink() which is exactly what our code does as well.

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipnetns.c#n735
https://github.com/containers/common/blob/6856d56252121a665cb820777982cc3f61f815af/pkg/netns/netns_linux.go#L191-L197

So this is not something I can understand, the unlink should not fail with EBUSY if it is unmounted.

But I also see:
failed to unmount NS: at /run/user/3138/netns/netns-fa758949-40a2-a5e2-2226-5523b2a4c0e7: no such file or directory
These are two different things. no such file or directory means we are trying to cleanup again after something else has already cleaned that up.
This matches with the other error that is logged when this happens: ="Storage for container 86154f69fa1feff30896274cc265e37fa78745adb1bd2778927053f8bbe7be36 has been removed"

The same goes for errors like this:

time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to get ips for container ID 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
           time="2023-08-23T21:47:02-05:00" level=error msg="tearing down network namespace configuration for container 59fb21294fac284ee4dddf37c0ecb98e27118bdb0274314170465c452f2e8dce: netavark: open container netns: open /run/netns/netns-0de355c1-cf59-05e0-c2cd-279f59a296f3: IO error: No such file or directory (os error 2)"

Here there must be a way that we for whatever reason end up in the cleanup path twice.

@mhoran
Copy link

mhoran commented Sep 29, 2023

I do see the device or resource busy error every now and then. Oddly, just recently it seemed to clear itself up, eventually. When I podman unshare I cannot umount the file, nor remove it. mount reports not mounted; rm reports device or resource busy. However, I believe from outside the namespace, I can delete the file. lsns --type=net does not show any corresponding NSFS. Quite odd indeed.

@edsantiago
Copy link
Member Author

I see different errors posted here that are not the same!

That is my fault, and I'm sorry. When scanning flakes to assign them to buckets, I look for common patterns but don't always compare everything exactly. I will be more careful.

@edsantiago
Copy link
Member Author

New flake. Not quite the same error message, but similar enough that I'm assigning here. From f39 rootless:

[+1802s] not ok 599 podman kube --network
...
<+010ms> # $ podman pod rm -t 0 -f test_pod
<+197ms> # time="2023-10-25T08:47:25-05:00" level=error msg="Unable to clean up network for container [infractrid]: \"unmounting network namespace for container [infractrid]: failed to unmount NS: at /run/user/6452/netns/netns-whatever: no such file or directory\""
         # 6b2f9b7e780b9986feb72921c5d8d059687b831537aa1a14d0456b3d67b3d3ee
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@edsantiago
Copy link
Member Author

Another ENOENT, in f39 rootless:

[+1763s] not ok 607 [700] podman kube play --replace external storage
...
<+011ms> # $ podman stop -a -t 0
<+451ms> # 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec
         # b48e32245140e2fbe17564503b67b1ff5c1263e2fd4aa3fcdea8afde28c5a626
         #
<+011ms> # $ podman pod rm -t 0 -f test_pod
<+190ms> # time="2023-11-10T13:36:19-06:00" level=error msg="getting rootless network namespace: failed to Statfs \"/run/user/2815/netns/rootless-netns-bfe0fe1f8f170aff795c\": no such file or directory"
         # time="2023-11-10T13:36:19-06:00" level=error msg="Unable to clean up network for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: \"unmounting network namespace for container 5562c7db09b37c167612109c79237ac32f7fb7575d0b9819ad8949b8499e4cec: failed to unmount NS: at /run/user/2815/netns/netns-6ae41715-c698-4a92-8658-020349c94f6f: no such file or directory\""
         # 19faf1a389d250f9a4a71fcdae449d6f2c38a89e2e51acc9bde5ec912db1093f
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@edsantiago
Copy link
Member Author

The latest list. Note that some of these are ENOENT, and some are EBUSY. Until given a reason to treat these as different bugs, I will continue to lump them together.

  • fedora-37 : int podman fedora-37 rootless host sqlite
  • fedora-38 : int podman fedora-38 root host boltdb
    • 12-05 10:02 in Podman stop podman stop --ignore bogus container
    • 11-30 19:39 in Podman stop podman stop --ignore bogus container
    • 11-29 20:10 in Podman stop podman stop --ignore bogus container
    • 11-28 15:37 in Podman pod stop podman pod stop single pod by name
    • 11-06-2023 11:53 in Podman stop podman stop container by id
    • 11-06-2023 11:53 in Podman stop podman stop --ignore bogus container
  • fedora-38 : int podman fedora-38 root host sqlite
  • fedora-38 : int podman fedora-38 rootless host boltdb
  • fedora-38 : int podman fedora-38 rootless host sqlite
  • fedora-38 : sys podman fedora-38 root host boltdb
  • fedora-39 : int podman fedora-39 root host sqlite
  • fedora-39 : int podman fedora-39 rootless host boltdb
  • fedora-39 : int podman fedora-39 rootless host sqlite
  • fedora-39β : int podman fedora-39β root host boltdb
  • rawhide : int podman rawhide root host sqlite
    • 11-06-2023 07:35 in Podman stop podman stop --ignore bogus container
    • 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
    • 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
  • rawhide : int podman rawhide rootless host sqlite
x x x x x x
int(29) podman(30) fedora-38(20) root(19) host(30) boltdb(16)
sys(1) rawhide(4) rootless(11) sqlite(14)
fedora-39(3)
fedora-39β(2)
fedora-37(1)

@piotr-kubiak
Copy link

The root issue most be something entirely else, symptoms look like we try to cleanup twice.

If that provides any clues, I am observing simmilar behaviour when the container is configured with restart policy unless-stopped.

@edsantiago
Copy link
Member Author

With #18442, this is now blowing up. Different error messages, but I'm pretty sure it's all the same bug.

New EACCESS variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:21:40-06:00" level=error msg="Unable to clean up network for container 4f387d5ee492fa77fa287eadbbdb6725aa4e24b879de087cd6d89b6f59014e84: \"netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\""

and the old ENOENT variant:

# podman [options] stop --all -t 0
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to get ips for container ID f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-03-04T06:26:19-06:00" level=error msg="netavark: open container netns: open /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: IO error: No such file or directory (os error 2)"
time="2024-03-04T06:26:19-06:00" level=error msg="Unable to clean up network for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: \"unmounting network namespace for container f5d3911edc288ff629303bfb7d9bb18224d755b2c80fb3fda9cdd6827efe94fe: failed to remove ns path: remove /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory, failed to unmount NS: at /run/netns/netns-8fd7cfa2-c3d6-d361-98ca-bdf199cb29f3: no such file or directory\""

...and an ENOENT variant with a shorter error message:

$ podman [options] stop --all -t 0
time="2024-03-04T06:27:52-06:00" level=error msg="Unable to clean up network for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: \"unmounting network namespace for container 52fd74d45cfc07d19cd49cbda438f9e57f07b37bc80627064cffb1bdc02461ad: failed to remove ns path: remove /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory, failed to unmount NS: at /run/user/2221/netns/netns-db173238-0a50-83c3-bb71-ca4e7d30a28b: no such file or directory\""

Here are today's failures, plus one from January

  • fedora-39 : int podman fedora-39 root host sqlite
    • 03-04 07:36 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
  • fedora-39 : sys podman fedora-39 rootless host sqlite
  • rawhide : int podman rawhide root host sqlite
    • 03-04 07:35 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
  • rawhide : int podman rawhide rootless host sqlite
    • 03-04 07:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
x x x x x x
int(3) podman(4) fedora-39(2) root(2) host(4) sqlite(4)
sys(1) rawhide(2) rootless(2)

@edsantiago
Copy link
Member Author

Still happening with brand-new (March 19) VMs, Three failures in just one CI run:

int podman fedora-38 root host boltdb

int podman fedora-39 rootless host sqlite

int podman rawhide root host sqlite

@edsantiago
Copy link
Member Author

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

@Luap99
Copy link
Member

Luap99 commented Mar 20, 2024

Is this (f39 rootless) the same error???

$ podman [options] stop --all -t 0
time="2024-03-20T12:25:48-05:00" \
    level=error \
    msg="Unable to clean up network for container SHA: \
        \"1 error occurred:\\n
        \\t* netavark: remove aardvark entries: check aardvark-dns netns: IO error: Permission denied (os error 13)\\n\\n\""

I'm going to assume it is, and file it as such, unless told otherwise.

No that is most likely something different

@edsantiago
Copy link
Member Author

My periodic ping. This seems to be happening a lot more with recent VMs

  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 04-02 12:16 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 03-25 16:49 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
  • fedora-39 : int podman fedora-39 root host sqlite
    • 04-01 15:46 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 03-25 17:55 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 03-20 13:40 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 03-19 17:53 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 03-19 23:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved PublishAll annotation in yaml
    • 03-19 16:19 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
  • rawhide : int podman rawhide root host sqlite
    • 04-01 13:31 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-26 11:41 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
    • 03-23 21:48 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
  • rawhide : int podman rawhide rootless host sqlite
    • 04-01 15:46 in TOP-LEVEL [AfterEach] Podman kube play with image data
    • 04-01 13:27 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
    • 03-20 13:37 in TOP-LEVEL [AfterEach] Podman kube play replace
    • 03-20 10:22 in TOP-LEVEL [AfterEach] Podman kube play override with udp should keep tcp from YAML file
    • 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play cap add
    • 03-19 17:39 in TOP-LEVEL [AfterEach] Podman kube play override with tcp should keep udp from YAML file
x x x x x x
int(18) podman(18) rawhide(10) rootless(10) host(18) sqlite(16)
fedora-39(6) root(8) boltdb(2)
fedora-38(2)

@edsantiago
Copy link
Member Author

Where are we on this one? I just saw this failure, f40 root, on VMs from containers/automation_images#349 with netavark 1.10.3-3:

# podman [options] stop --all -t 0
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to get ips for container ID 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.88.0.0/16 on network podman"
time="2024-05-08T08:40:48-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: IO error: No such file or directory (os error 2)"
time="2024-05-08T08:40:48-05:00" level=error msg="Unable to clean up network for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: \"unmounting network namespace for container 4966574a794623c18a431d97adf2ea6192819e96755529acd35a529670985b69: failed to remove ns path: remove /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory, failed to unmount NS: at /run/netns/netns-dce2075a-1bca-f84d-c945-d1bc1641f2f6: no such file or directory\""

(error in this one is ENOENT, not EBUSY)

@Luap99
Copy link
Member

Luap99 commented May 15, 2024

FYI: The reason you see this more is because I enabled the warnings check in AfterEach() #18442

So previously we just didn't see them, in the logs above they all failed in AfterEach, as mentioned before and in other issues the problem is that something tries to cleanup twice but I cannot see why and where this would be happening.

@edsantiago
Copy link
Member Author

Here's one with a lot more context, does that help? (Three total failures in this log, so be sure to Page-End then click on each individual failure)

@edsantiago
Copy link
Member Author

The "pid file blah" message is new, does it help? In f39 rootless:

   $ podman [options] pod rm -fa -t 0
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to get ips for container ID 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="rootless netns ref counter out of sync, counter is at -1, resetting it back to 0"
>> time="2024-05-23T07:10:51-05:00" level=warning msg="failed to read rootless netns program pid: open /tmp/CI_E7Th/podman-e2e-1332300349/subtest-2709937137/runroot/networks/rootless-netns/rootless-netns-conn.pid: no such file or directory"
   time="2024-05-23T07:10:51-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
   time="2024-05-23T07:10:51-05:00" level=error msg="1 error occurred:\n\t* netavark: open container netns: open /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: IO error: No such file or directory (os error 2)\n\n"
   time="2024-05-23T07:10:51-05:00" level=error msg="Unable to clean up network for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: \"unmounting network namespace for container 5d282f67ae0c245f323779c5a57c579bd6de58bc7fbb688da46ffec96e8cc7b9: failed to remove ns path: remove /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory, failed to unmount NS: at /run/user/4848/netns/netns-496a4e9f-0424-fab8-8bd4-9ffb106b0d86: no such file or directory\""

@Luap99
Copy link
Member

Luap99 commented May 23, 2024

No not really

@francoism90
Copy link

Any workarounds? I have the same issue.

@Luap99
Copy link
Member

Luap99 commented Jun 21, 2024

Any workarounds? I have the same issue.

This is (mostly) a flake tracker for CI, if you have a reproducer for this issue I would be very happy if you can share it.
Then maybe I can understand the issue and see if there is a workaround/fix.

@francoism90
Copy link

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

@Luap99
Copy link
Member

Luap99 commented Jun 21, 2024

@Luap99 It seems indeed to be happen when using a wrong configuration.

Sometimes the only workaround is to fully reboot the machine, as the rootlessport process isn't being killed or even killable.

Can you share podman commands, what does wrong configuration mean?
I also don't see how this is related rootlessport. The rootlessport process should exit on its own when conmon after conmon exits. If it does not do that then it is a bug, however it should always be killable via SIGKILL I assume.

@francoism90
Copy link

francoism90 commented Jun 22, 2024

@Luap99 I'm using Podman Quadlet.

The problem occurs when a container fails to run at some point in the startup. Like you accidentally created a configuration mistake.

It's possible to stop the container, but for some reason the network process doesn't know what to do anymore, and you'll end up the IPAM error or cannot clean network.

The only way is to fully kill all the network related process for Podman, but even that may not be enough. By rebooting the machine, the container and it's configured network starts just fine.

I haven't used Docker rootless on Linux for a long time, but I can remember the same problem. It just seems to lose track of the networking and doesn't know what is actually killed or not. As I said before, the rootlessport processes stay open, even when you stop the container.

@francoism90
Copy link

francoism90 commented Jun 22, 2024

An example to trigger this, is by using a webserver image, nginx-alpine:latest for example. Make sure it's using a network, running rootless, and also use requires=nginx as service option for other containers that communicate with the container.

Add a site/server-block in the nginx container, and make a mistake in it's configuration. For example listen 8080 instead of listen 8080; (notice the missing ending ;). Something that can be easily overseen. I don't know if invalid proxy address also triggers it.

When you now start the container(s), it will crash the main process, because of a configuration error. So you stop the container, adjust the configuration, but when starting again, you'll end up with network errors.

Sometimes it can be fixed by killing all process, like rootlessport and things connected to it. But sometimes you really have to reboot to recover. After a reboot everything works as expected.

This is my full config, if you're interested: https://github.com/foxws/foxws/tree/main/podman

@edsantiago
Copy link
Member Author

This one has happening A LOT in my no-retry PR. And, in case it helps, also in my parallel-system-tests PR.

  • debian-13 : int podman debian-13 root host sqlite
  • debian-13 : int podman debian-13 rootless host sqlite
    • 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play multi doc yaml with multiple services, pods and deployments
    • 07-22 07:49 in TOP-LEVEL [AfterEach] Podman kube generate - --privileged container
    • 07-17 17:12 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
    • 07-16 07:30 in TOP-LEVEL [AfterEach] Podman kube play test with reserved init annotation in yaml
    • 07-11 16:50 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 07-07 17:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved CIDFile annotation in yaml
    • 06-03-2024 13:10 in TOP-LEVEL [AfterEach] Podman kube play test with reserved init annotation in yaml
    • 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
    • 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube play with image data
    • 05-15-2024 10:10 in TOP-LEVEL [AfterEach] Podman kube generate - --privileged container
  • debian-13 : sys podman debian-13 rootless host sqlite
  • fedora-38 : int podman fedora-38 root container boltdb
  • fedora-38 : int podman fedora-38 root host boltdb
    • 04-04-2024 10:31 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 03-19-2024 12:02 in TOP-LEVEL [AfterEach] Podman kube play test restartPolicy
    • 03-05-2024 15:21 in TOP-LEVEL [AfterEach] Podman kube play should not rename pod if container in pod has same name
  • fedora-38 : int podman fedora-38 rootless host boltdb
    • 04-02-2024 12:16 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
    • 03-25-2024 16:49 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
  • fedora-39 : int podman fedora-39 root container boltdb
    • 07-11 18:35 in TOP-LEVEL [AfterEach] Podman kube play cap add
  • fedora-39 : int podman fedora-39 root container sqlite
    • 04-02-2024 21:15 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 04-02-2024 18:05 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
  • fedora-39 : int podman fedora-39 root host sqlite
    • 04-04-2024 21:56 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 04-03-2024 09:52 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
    • 04-01-2024 15:46 in TOP-LEVEL [AfterEach] Podman run podman run a container based on local image with short options
    • 03-25-2024 17:55 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 03-20-2024 13:40 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 03-19-2024 17:53 in TOP-LEVEL [AfterEach] Podman start podman start container --filter
    • 03-05-2024 07:27 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 03-04-2024 12:20 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 03-04-2024 10:42 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
  • fedora-39 : int podman fedora-39 rootless host boltdb
  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 04-02-2024 16:21 in TOP-LEVEL [AfterEach] Podman kube play RunAsUser
    • 03-19-2024 23:26 in TOP-LEVEL [AfterEach] Podman kube play test with reserved PublishAll annotation in yaml
    • 03-19-2024 16:19 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 03-19-2024 12:00 in TOP-LEVEL [AfterEach] Podman start podman container start single container by short id
    • 03-06-2024 11:07 in TOP-LEVEL [AfterEach] Podman kube play with TerminationGracePeriodSeconds set
    • 03-06-2024 09:32 in 16973
  • fedora-40 : int podman fedora-40 root container sqlite
    • 07-15 11:05 in TOP-LEVEL [AfterEach] Podman start podman start multiple containers
    • 07-11 16:52 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 06-03-2024 07:23 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
  • fedora-40 : int podman fedora-40 root host sqlite
    • 05-15-2024 11:37 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 05-15-2024 11:37 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
    • 05-15-2024 10:11 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 05-08-2024 10:03 in TOP-LEVEL [AfterEach] Podman cp podman cp file
  • fedora-40 : int podman fedora-40 rootless host sqlite
    • 05-23-2024 08:19 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from optional configmap as envs
    • 05-23-2024 08:19 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml succeeds for optional env value with missing key
  • fedora-40 : sys podman fedora-40 rootless host sqlite
    • 07-22 12:05 in [sys] [700] podman play --service-container
  • rawhide : int podman rawhide root host sqlite
    • 07-11 13:45 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 07-07 17:28 in TOP-LEVEL [AfterEach] Podman start podman start single container by name
    • 05-15-2024 10:12 in TOP-LEVEL [AfterEach] Podman cp podman cp file
    • 05-06-2024 11:02 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 04-04-2024 21:56 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
    • 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
    • 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml and files uses all env values from both sources
    • 04-01-2024 13:31 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-26-2024 11:41 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
    • 03-23-2024 21:48 in TOP-LEVEL [AfterEach] Podman kube play with auto update annotations for first container only
    • 03-23-2024 21:48 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source container in pod
    • 03-19-2024 12:02 in TOP-LEVEL [AfterEach] Podman run podman run a container based on remote image
    • 03-05-2024 15:21 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
    • 03-04-2024 07:35 in TOP-LEVEL [AfterEach] Podman start podman container start single container by id
  • rawhide : int podman rawhide rootless host sqlite
    • 07-22 10:58 in TOP-LEVEL [AfterEach] Podman kube play with image data
    • 06-05-2024 10:05 in TOP-LEVEL [AfterEach] Podman kube play with configmap in multi-doc yaml and files uses all env values from both sources
    • 06-05-2024 10:05 in TOP-LEVEL [AfterEach] Podman kube play test volumes-from annotation with source containers external
    • 05-23-2024 08:18 in TOP-LEVEL [AfterEach] Podman kube play test get all key-value pairs from configmap as envs
    • 04-01-2024 15:46 in TOP-LEVEL [AfterEach] Podman kube play with image data
    • 04-01-2024 13:27 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
    • 03-20-2024 13:37 in TOP-LEVEL [AfterEach] Podman kube play replace
    • 03-20-2024 10:22 in TOP-LEVEL [AfterEach] Podman kube play override with udp should keep tcp from YAML file
    • 03-19-2024 17:39 in TOP-LEVEL [AfterEach] Podman kube play cap add
    • 03-19-2024 17:39 in TOP-LEVEL [AfterEach] Podman kube play override with tcp should keep udp from YAML file
    • 03-04-2024 13:48 in TOP-LEVEL [AfterEach] Podman kube play test with init containers and annotation set
    • 03-04-2024 07:35 in TOP-LEVEL [AfterEach] Podman run networking podman run network bind to 127.0.0.1
x x x x x x
int(76) podman(78) rawhide(26) rootless(40) host(71) sqlite(65)
sys(2) fedora-39(24) root(38) container(7) boltdb(13)
debian-13(12)
fedora-40(10)
fedora-38(6)

@edsantiago
Copy link
Member Author

Just as an FYI, this is the past week's worth of this bug, the instances where "netns ref counter out of sync" does not appear in error message:

  • debian-13 : int podman debian-13 root host sqlite
    • 08-01 08:08 in TOP-LEVEL [AfterEach] Podman kube play with replicas limits the count to 1 and emits a warning
  • debian-13 : int podman debian-13 rootless host sqlite
    • 07-31 11:15 in TOP-LEVEL [AfterEach] Podman pod create podman start infra container different image
    • 07-31 11:15 in TOP-LEVEL [AfterEach] Podman start podman start single container by id
  • fedora-40 : int podman fedora-40 root container sqlite
    • 07-31 17:47 in TOP-LEVEL [AfterEach] Podman start podman start container with special pidfile
    • 07-31 11:16 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
    • 07-30 07:45 in TOP-LEVEL [AfterEach] Podman cp podman cp file
  • fedora-40 : int podman fedora-40 root host sqlite
    • 08-01 07:10 in TOP-LEVEL [AfterEach] Podman play kube with build Check that image is built using Containerfile
    • 07-31 17:48 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
  • fedora-40 : int podman fedora-40 rootless host sqlite
    • 07-31 10:19 in TOP-LEVEL [AfterEach] Podman UserNS support podman --userns=container:CTR
  • rawhide : int podman rawhide root host sqlite
    • 07-31 17:50 in TOP-LEVEL [AfterEach] Podman kube play support List kind
    • 07-31 11:20 in TOP-LEVEL [AfterEach] Podman pod start podman pod start single pod by name
  • rawhide : int remote rawhide root host sqlite [remote]
    • 08-01 07:12 in Podman events podman events network connection
x x x x x x
int(12) podman(11) fedora-40(6) root(9) host(9) sqlite(12)
remote(1) rawhide(3) rootless(3) container(3)
debian-13(3)

@edsantiago
Copy link
Member Author

Funny observation about this one, and I'm not 100% certain, but I think that when I see this one (the one without "netns sync"), (1) it happens in pairs on the same CI job, and (2) I see failed to get ips. See here and compare the two failures. HTH.

@Luap99
Copy link
Member

Luap99 commented Aug 2, 2024

Note that netns ref counter out of sync is a rootless only thing so you should never see that as root regardless. I see some issues around the ref counter going side wise on errors but this needs some root cause error first, this one errors seems like a good candidate for that which is then cascading further errors such as the missing ipam entries, etc...

Luap99 added a commit to Luap99/common that referenced this issue Aug 6, 2024
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 self-assigned this Aug 6, 2024
Luap99 added a commit to Luap99/common that referenced this issue Aug 6, 2024
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Luap99 added a commit to Luap99/common that referenced this issue Aug 8, 2024
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
TomSweeneyRedHat pushed a commit to TomSweeneyRedHat/common that referenced this issue Aug 9, 2024
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: tomsweeneyredhat <tsweeney@redhat.com>
Luap99 added a commit to Luap99/libpod that referenced this issue Aug 12, 2024
We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.

The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.

Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.

I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.

Fixes containers#21569
Also fixes the netns removal ENOENT issues seen in containers#19721.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Aug 12, 2024
We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.

The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.

Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.

I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.

Fixes containers#21569
Also fixes the netns removal ENOENT issues seen in containers#19721.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member

Luap99 commented Aug 15, 2024

Fixed in #23519

@Luap99 Luap99 closed this as completed Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

No branches or pull requests

7 participants