Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI flake: Panic in Spec Teardown: send on closed channel #6518

Closed
edsantiago opened this issue Jun 8, 2020 · 19 comments · Fixed by #8541
Closed

CI flake: Panic in Spec Teardown: send on closed channel #6518

edsantiago opened this issue Jun 8, 2020 · 19 comments · Fixed by #8541
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

cirrus-flake-xref is reporting three instances of this since May 26:

Panic in Spec Teardown (AfterEach) [2.677 seconds]
         Podman containers 
         /var/tmp/go/src/github.com/containers/libpod/pkg/bindings/test/containers_test.go:19
           podman wait to pause|unpause condition [AfterEach]
           /var/tmp/go/src/github.com/containers/libpod/pkg/bindings/test/containers_test.go:282
         
           Test Panicked
           send on closed channel
           /usr/lib/golang/src/runtime/panic.go:969
         
           Full Stack Trace
           panic(0x10397a0, 0x134e430)
           	/usr/lib/golang/src/runtime/panic.go:969 +0x166
           github.com/containers/libpod/pkg/bindings/test.glob..func5.20.2(0xc0000b2a38, 0xc00004dc90, 0xc000319ae8, 0xc000011180)
           	/var/tmp/go/src/github.com/containers/libpod/pkg/bindings/test/containers_test.go:308 +0xc8
           created by github.com/containers/libpod/pkg/bindings/test.glob..func5.20
           	/var/tmp/go/src/github.com/containers/libpod/pkg/bindings/test/containers_test.go:304 +0x6cc

Log links:

Link to containers_test.go:308.

All failures have been in fedora-32 special_testing_bindings.

@mheon mheon added the flakes Flakes from Continuous Integration label Jun 8, 2020
@edsantiago
Copy link
Member Author

Still happening, and (as far as my logs can tell) still only on f32:

Podman containers [AfterEach] podman wait to pause|unpause condition

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 25, 2020

@edsantiago Still an issue?

@edsantiago
Copy link
Member Author

And another one just now, on my own PR #7070

@edsantiago
Copy link
Member Author

Another one yesterday:

Still no sign of it on anything other than f32

edsantiago added a commit to edsantiago/libpod that referenced this issue Jul 29, 2020
The "podman wait to pause|unpause condition" test is failing
several times a day, always a flake. Issue containers#6518.

Disable it until the cause can be identified and fixed.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 28, 2020

@edsantiago Still an issue?

@edsantiago
Copy link
Member Author

Uh, well, no, because it was flaking so much that I disabled the test (#7143). I haven't tried to generate a reproducer, but if today is calm I will try to do so. I've removed the stale-issue label because until the test is reenabled, we don't know.

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 31, 2020
Reference: containers#6518, a very-frequently-flaking CI test, disabled
a month ago (containers#7143) because it was triggering so often in CI.
Unfortunately, that seems to have simply swept the problem
under the rug. AFAICT nobody has bothered to look at the
root bug, so let's just reenable. If the problem persists,
I'll let annoyed developers squeaky-wheel 6158 so there's
some incentive to fix it. If the problem has miraculously
gone away in the last month, that's a win too.

(This test failure does not reproduce on my laptop, nor
does it lend itself to devising a simple reproducer on
a test VM.)

Also: since containers#5325 appears to have been closed as fixed,
remove a 'Skip' that references it. Unfortunately this
also requires removing a lot of other cruft. This was
an incidental oh-by-the-way addition that I thought
would be trivial but ended up causing a much larger diff.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago
Copy link
Member Author

Yes, still happening, in post-merge testing on master:

[+0517s] •! Panic in Spec Teardown (AfterEach) [2.851 seconds]
         Podman containers 
         /var/tmp/go/src/github.com/containers/podman/pkg/bindings/test/containers_test.go:17
           podman wait to pause|unpause condition [AfterEach]
           /var/tmp/go/src/github.com/containers/podman/pkg/bindings/test/containers_test.go:272
         
           Test Panicked
           send on closed channel
           /usr/lib/golang/src/runtime/panic.go:969
         
           Full Stack Trace
           panic(0x104b380, 0x13644d0)
           	/usr/lib/golang/src/runtime/panic.go:969 +0x166
           github.com/containers/podman/v2/pkg/bindings/test.glob..func5.20.2(0xc000010200, 0xc0001d7640, 0xc000c7eba8, 0xc0010b8520)
           	/var/tmp/go/src/github.com/containers/podman/pkg/bindings/test/containers_test.go:298 +0xc8
           created by github.com/containers/podman/v2/pkg/bindings/test.glob..func5.20
           	/var/tmp/go/src/github.com/containers/podman/pkg/bindings/test/containers_test.go:294 +0x6cc

Links, from most- to least-specific:

edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 14, 2020
Reference: containers#6518, a very-frequently-flaking CI test, disabled
a month ago (containers#7143) because it was triggering so often in CI.
Unfortunately, that seems to have simply swept the problem
under the rug. AFAICT nobody has bothered to look at the
root bug, so let's just reenable. If the problem persists,
I'll let annoyed developers squeaky-wheel 6158 so there's
some incentive to fix it. If the problem has miraculously
gone away in the last month, that's a win too.

(This test failure does not reproduce on my laptop, nor
does it lend itself to devising a simple reproducer on
a test VM.)

Also: since containers#5325 appears to have been closed as fixed,
remove a 'Skip' that references it. Unfortunately this
also requires removing a lot of other cruft. This was
an incidental oh-by-the-way addition that I thought
would be trivial but ended up causing a much larger diff.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@rhatdan rhatdan added kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. labels Oct 7, 2020
@github-actions
Copy link

github-actions bot commented Nov 8, 2020

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 10, 2020

@edsantiago still seeing this?

@edsantiago
Copy link
Member Author

Sorry, I haven't had time to look (at this nor the other stale-issue that you haven't pinged me about yet). Won't have time until Thursday most likely. But I will, I promise.

@edsantiago
Copy link
Member Author

Yes, still happening. I'll skip the September instances, and just list October/November:

Note to self: the search term for cirrus-flake-summarize is "unpause"

edsantiago added a commit to edsantiago/libpod that referenced this issue Dec 1, 2020
It's continuing to flake, and I see no activity on containers#6518.

Flakes are evil. Let's just disable the test again, until someone
takes the initiative to fix the bug.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Dec 1, 2020

@edsantiago What do you think of this patch. I think the issue is the second errchan is setup before the first one completes. Causing the issue you are seeing. If we separate the channels, we should not be closing the channel before it is used.

diff.txt

@edsantiago
Copy link
Member Author

I don't see how there could be a race here -- the code looks really sequential to me. But Go has subtleties I don't understand. I'm willing and even eager to give your approach a try for a few months, if you'd like to submit that! I will close this once your PR goes through CI and merges. Thank you!

@rhatdan
Copy link
Member

rhatdan commented Dec 1, 2020

I am wondering if threads would skip over the wait error channel, but you may be right. Only way I could see this happening would be if the second err = make(chan error) could fire before the close(errChan) in the fist function happened.

rhatdan added a commit to rhatdan/podman that referenced this issue Dec 1, 2020
The It("podman wait to pause|unpause condition"... test is
flaking every so often when a messages is sent in the second
function to a channel.  It is my believe that in between the time
the first function sends a message to the channel and before it closes
the channel the second errChan=make() has happened.  This would mean that
the fist function closes the second errChan, and then when the second
function sends a message to the second errChan, it fails and blows up with
the error you are seeing.

By creating a different variable for the second channel, we eliminate the race.

Fixes: containers#6518

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants