Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race: new rm -fa with dependencies is not actually waiting for container removal #18874

Closed
edsantiago opened this issue Jun 13, 2023 · 15 comments · Fixed by #19863
Closed

race: new rm -fa with dependencies is not actually waiting for container removal #18874

edsantiago opened this issue Jun 13, 2023 · 15 comments · Fixed by #19863
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

[It] podman rm -fa with dependencies
...
 podman [options] rm -fa
fdb8e9ac4cdf66db5548e8dbf8b0fc73f9c8cbb5fee01efa480787a1706928c4
d85817e78f2be4c058f93bc912f960d513a405260ba213757577b44b21973399
# podman [options] ps -aq
d85817e78f2b

[FAILED] expected 1 to equal 0 (as in, expected no output from "ps -aq")

This is a big one, failing often; it's only a matter of time before it hits us even in main.

  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 06-13 09:53 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root container sqlite
    • 06-13 09:50 in Podman rm podman rm -fa with dependencies
    • 06-13 08:27 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root host sqlite
    • 06-13 09:54 in Podman rm podman rm -fa with dependencies
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jun 13, 2023
@Luap99
Copy link
Member

Luap99 commented Jun 13, 2023

@mheon PTAL

@mheon
Copy link
Member

mheon commented Jun 13, 2023

I'll take a look this afternoon

@mheon
Copy link
Member

mheon commented Jun 13, 2023

@edsantiago I don't suppose you have any idea what makes this reproduce? I'm not seeing any obvious reason why podman rm -a would exit early.

Is there a possibility it's not removing the second container at all? Or is it definitely being removed, just after the command exits?

@edsantiago
Copy link
Member Author

Sorry, I haven't looked into it beyond reporting. I will do so now - will start by looking at the test for obvious races, and will then see if I can reproduce it.

@Luap99
Copy link
Member

Luap99 commented Jun 13, 2023

@mheon If you look at the logs you can see the the that in AfterEach() both podman stop --all and podman rm -fa both output this cid, so yes I think this is not a race it was not removed at all for some reason.

@edsantiago
Copy link
Member Author

Gosh, I thought it'd be so easy to reproduce. It's not. Giving up for now; here's my attempt at a reproducer:

#!/bin/bash

td=/tmp/mypodmantmpdir
mkdir -p $td/root $td/runroot

PM="bin/podman --cgroup-manager cgroupfs --storage-driver vfs --events-backend file --db-backend sqlite --network-backend netavark --tmpdir $td --root $td/root --runroot $td/runroot"
IMAGE=quay.io/libpod/alpine:latest

$PM rm -fa

set -e

while :;do
    $PM run -d --name mytop $IMAGE top
    $PM run -d --network container:mytop $IMAGE top

    $PM rm -fa
    left=$($PM ps -aq)
    echo $left
    test -z "$left"
done

@edsantiago
Copy link
Member Author

Could not reproduce on my f38 laptop, over many hours. Tried 1mt. Reproduced on the second iteration, podman main @ 7c76907. Given how quickly it reproduces -- less than a minute, usually less than five seconds -- I confirmed that none of the options matter. Here is the barest-bones reproducer:

#!/bin/bash

PM=bin/podman
IMAGE=quay.io/libpod/alpine:latest

$PM rm -fa

set -e

while :;do
    $PM run -d --name mytop $IMAGE top
    $PM run -d --network container:mytop $IMAGE top

    $PM rm -fa
    left=$($PM ps -aq)
    echo $left
    test -z "$left"
done

@mheon in answer to your question, the container is not being removed. Here's my ps a few minutes after script failure:

# bin/podman ps
CONTAINER ID  IMAGE                         COMMAND     CREATED        STATUS        PORTS       NAMES
659d48b527d8  quay.io/libpod/alpine:latest  top         4 minutes ago  Up 4 minutes              mytop

(that's "ps", not "ps -a").

@edsantiago
Copy link
Member Author

Incidentally, my observations (Cirrus logs as well as dozens of iterations on 1mt) show that it is always the first container that survives. This may be just chance, but it seems unlikely over this many failures.

@edsantiago
Copy link
Member Author

More (possibly useless) data. This reproduces so quickly on 1mt that I wrote a test to confirm my assertion above, that the survivor is always cid1. It is (so far, over hundreds of iterations) but I'm seeing, in my loop, rather a lot of these warnings:

error opening file `/run/crun/SHA/status`: No such file or directory

I don't know where they're coming from, and I don't know what that SHA refers to. It's not an error, because set -e.

Oh, here's more fun. In the process of playing, I ended up with the umount/EINVAL bug (#18831)!

# ./reproducer
Error: creating container storage: the container name "mytop" is already in use by 335f21f6bee6407c79bf12181427c7db01b431fa2533112e44c9487002a14bd2. You have to remove that container to be able to reuse that name: that name is already in use

# bin/podman ps -aq   <<<--- no output
# bin/podman ps -a --external
CONTAINER ID  IMAGE                         COMMAND     CREATED        STATUS      PORTS       NAMES
335f21f6bee6  quay.io/libpod/alpine:latest  storage     6 minutes ago  Storage                 mytop

# bin/podman rm -af    <<<-- no output, no effect

#  bin/podman rm -f 335f
WARN[0000] Unmounting container "335f" while attempting to delete storage: unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument 
Error: removing storage for container "335f": unmounting "/var/lib/containers/storage/overlay/e3149559a6e1ae65a7b0b2e140ceb2bb7e7bdc06b9654bd5d46ba5d3adf7ad46/merged": invalid argument

From this point on, even podman system reset -f is of no help. Podman is completely hosed, nothing else works. bin/podman umount -a fails, EINVAL. mount (system mount, not podman mount) shows nothing unusual. I guess this means I am done with this 1mt.

In case it helps, here is my final confirm-the-survivor reproducer:

#!/bin/bash                                                                                                                                                     
                                                                                                                                                                
PM=bin/podman                                                                                                                                                   
IMAGE=quay.io/libpod/alpine:latest                                                                                                                              
                                                                                                                                                                
$PM rm -fa                                                                                                                                                      
echo                                                                                                                                                            
                                                                                                                                                                
set -e                                                                                                                                                          
                                                                                                                                                                
while :;do                                                                                                                                                      
    cid1=$($PM run -d --name mytop $IMAGE top)                                                                                                                  
    cid2=$($PM run -d --network container:mytop $IMAGE top)                                                                                                     
                                                                                                                                                                
    rm=$($PM rm -fa)                                                                                                                                            
# Order is not predictable. Usually it's cid2 cid1, but sometimes reverse                                                                                       
#    if [[ "$(echo $rm)" != "$cid2 $cid1" ]]; then                                                                                                              
#        echo                                                                                                                                                   
#        echo "rm: got  $rm"                                                                                                                                    
#       echo "expected $cid2 $cid1"                                                                                                                             
#       exit 1                                                                                                                                                  
#    fi                                                                                                                                                         
                                                                                                                                                                
    left=$($PM ps -aq --notruncate)                                                                                                                             
    if [[ -n "$left" ]]; then                                                                                                                                   
        if [[ "$left" != "$cid1" ]]; then                                                                                                                       
            echo                                                                                                                                                
            echo "WHOA! Leftover container is not cid1!"                                                                                                        
            echo " cid1 : $cid1"                                                                                                                                
            echo " cid2 : $cid2"                                                                                                                                
            echo " left : $left"                                                                                                                                
            exit 1                                                                                                                                              
        fi                                                                                                                                                      
        echo "Triggered, usual case."                                                                                                                           
        $PM rm -fa >/dev/null                                                                                                                                   
    fi                                                                                                                                                          
done

@Luap99
Copy link
Member

Luap99 commented Jun 22, 2023

@mheon Can you take a look? I think this should be addressed before the next release.

@edsantiago
Copy link
Member Author

I don't think this helps in any way, given the super-easy reproducer, but just in case:

  • debian-12 : int podman debian-12 root host sqlite
    • 06-20 20:05 in Podman rm podman rm -fa with dependencies
  • fedora-37 : int podman fedora-37 root container sqlite
    • 06-20 20:10 in Podman rm podman rm -fa with dependencies
    • 06-15 08:56 in Podman rm podman rm -fa with dependencies
    • 06-14 23:32 in Podman rm podman rm -fa with dependencies
  • fedora-37 : int podman fedora-37 root host sqlite
    • 06-21 15:11 in Podman rm podman rm -fa with dependencies
    • 06-15 18:12 in Podman rm podman rm -fa with dependencies
    • 06-15 15:23 in Podman rm podman rm -fa with dependencies
  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 06-15 18:06 in Podman rm podman rm -fa with dependencies
    • 06-13 14:11 in Podman rm podman rm -fa with dependencies
    • 06-13 11:20 in Podman rm podman rm -fa with dependencies
    • 06-13 09:53 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root container sqlite
    • 06-15 18:08 in Podman rm podman rm -fa with dependencies
    • 06-15 15:17 in Podman rm podman rm -fa with dependencies
    • 06-14 20:46 in Podman rm podman rm -fa with dependencies
    • 06-13 09:50 in Podman rm podman rm -fa with dependencies
    • 06-13 08:27 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root host sqlite
    • 06-13 09:54 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 06-15 16:46 in Podman rm podman rm -fa with dependencies
    • 06-14 20:47 in Podman rm podman rm -fa with dependencies
    • 06-14 14:17 in Podman rm podman rm -fa with dependencies

Basically, all distros, root & rootless. Have not seen it happen in podman-remote nor in aarch64 (yet)

@edsantiago
Copy link
Member Author

edsantiago commented Jul 7, 2023

New variation:

# podman [options] run --http-proxy=false --name ctr1 -d quay.io/libpod/alpine:latest top
2e485c2f06869bf64450cc183c807ff4946cf6ab58a8166a36da427bd6089a19
# podman [options] run -d --network container:ctr1 quay.io/libpod/alpine:latest top
1e73e0e60800b74061afc1f6dcb332a63f5ec6236ec7a86567bd8acdceec6997
# podman [options] rm -fa
send signal to pidfd: No such process
1e73e0e60800b74061afc1f6dcb332a63f5ec6236ec7a86567bd8acdceec6997
2e485c2f06869bf64450cc183c807ff4946cf6ab58a8166a36da427bd6089a19

[FAILED] rm -fa error logged
Expected
    <string>: send signal to pidfd: No such process
to be empty

That is: both containers were stopped and rm'ed (at least according to the log), but there was an error/warning logged.

UPDATE: never mind, this is probably #18452 (the "open pidfd" flake)

@nifoc
Copy link

nifoc commented Jul 27, 2023

Given that a reproducer already exists, I have nothing specific to contribute.

Other than to say that after upgrading to 4.6.0, this has started to affect us as well.

@mheon
Copy link
Member

mheon commented Jul 27, 2023

I'll take a further look tomorrow

@edsantiago
Copy link
Member Author

Welcome back, @mheon! I hope your PTO was restful. Quick reminder that this is still a huge problem, flaking very often (but passing on ginkgo retry):

  • debian-12 : int podman debian-12 root host sqlite
    • 07-31 16:41 in Podman rm podman rm -fa with dependencies
  • debian-13 : int podman debian-13 root host sqlite
    • 08-09 20:44 in Podman rm podman rm -fa with dependencies
  • debian-13 : int podman debian-13 rootless host sqlite
    • 08-03 22:41 in TOP-LEVEL [AfterEach] Podman pod restart podman pod restart multiple pods
  • fedora-37 : int podman fedora-37 root container sqlite
    • 08-13 10:25 in Podman rm podman rm -fa with dependencies
    • 08-10 23:07 in Podman rm podman rm -fa with dependencies
    • 08-10 08:48 in Podman rm podman rm -fa with dependencies
    • 08-03 23:59 in Podman rm podman rm -fa with dependencies
    • 08-03 19:01 in Podman rm podman rm -fa with dependencies
  • fedora-37 : int podman fedora-37 root host sqlite
    • 08-11 22:33 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root container sqlite
    • 08-12 09:19 in Podman rm podman rm -fa with dependencies
    • 08-09 20:44 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 root host sqlite
    • 08-03 23:55 in Podman rm podman rm -fa with dependencies
    • 08-03 13:25 in Podman rm podman rm -fa with dependencies
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 08-11 15:09 in Podman rm podman rm -fa with dependencies
    • 08-09 23:27 in Podman rm podman rm -fa with dependencies
  • rawhide : int podman rawhide root host sqlite
    • 08-09 23:25 in Podman rm podman rm -fa with dependencies
    • 08-08 17:49 in Podman rm podman rm -fa with dependencies
    • 08-03 13:24 in Podman rm podman rm -fa with dependencies

mheon added a commit to mheon/libpod that referenced this issue Sep 5, 2023
When removing a container's dependency, getting an error that the
container has already been removed (ErrNoSuchCtr and
ErrCtrRemoved) should not be fatal. We wanted the container gone,
it's gone, no need to error out.

[NO NEW TESTS NEEDED] This is a race and thus hard to test for.

Fixes containers#18874

Signed-off-by: Matt Heon <mheon@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants