Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cirrus CI: agent stopped responding #8068

Closed
edsantiago opened this issue Oct 19, 2020 · 15 comments · Fixed by #8188
Closed

Cirrus CI: agent stopped responding #8068

edsantiago opened this issue Oct 19, 2020 · 15 comments · Fixed by #8188
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

This is now far and away the most common CI flake.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Oct 19, 2020
@cevich
Copy link
Member

cevich commented Oct 20, 2020

@edsantiago Ref: https://cirrus-ci.org/faq/#ci-agent-stopped-responding

Note: There's a difference between this error happening on a VM-backed task and a container task.

Otherwise, I did mail support the other week about this, and they had also noted an recent uptick in these errors. However, since the agent process is the most basic of all the Cirrus-CI moving parts, the issue is intermittent so it's going to be really hard to pin down a precise cause. It could be anything from memory/cpu exhaustion to network hiccups to outright cloud-VM failures or any combination of those 😕

Though it's possible that #8080 could help. I'd be curious to know if there's any change in the occurrence rate once that merges.

@cevich
Copy link
Member

cevich commented Oct 20, 2020

Oh I'd also comment that: I have played around on a live VM while tests were running, and deliberately kill -9 their agent process. There's a significant timeout, on the order of 30-60 seconds required for Cirrus-CI (the service) to notice and kill the VM.

@cevich
Copy link
Member

cevich commented Oct 21, 2020

FWIW: Google does have a way to monitor things like CPU/Memory usage (through an agent of their own). However, having played a bit with the logging/reporting side of this, it's very much more intended to support long-running VMs. This also has the added problem of, OOM/CPU/Network problems could also potentially impact the logging agent itself 😖 Worse, if a VM dies (and therefor also the logging agent) we stand a good chance of loosing the most important/valuable data (i.e. the set not sent to the logging service).

[sigh]

Honestly I'm really not sure what can be done here @edsantiago. I feel that the automatic-retry Cirruc-CI performs in these situations, might be the best we can do - Assuming lack of some other obvious failure pattern. I know the Cirrus-CI service-side of the equation is well monitored, so failures on their end will be dealt with (having met some threshold). It's great that you're collecting these statistics, and that may prove useful in the future...even (as in this case) if the results are seemingly random.

@edsantiago
Copy link
Member Author

You keep saying "automatic-retry". I've never seen that: it has been my experience that we all have to manually restart these failures, even hours afterward. I think there is a disconnect somewhere.

@cevich
Copy link
Member

cevich commented Oct 21, 2020

Let me see if I can find an example for you...

@cevich
Copy link
Member

cevich commented Oct 21, 2020

...here: In PR #8021 int podman fedora-32 root host ran three times (not two):

  1. The first iteration created at 13:56 failed after 22minutes into 'main'
  2. The automatic-retry at 14:39 failed after 36minutes into 'main' (Note the yellow "Automatic Re-Run" at the top)
  3. Cirrus-CI Stopped, it only gives one automatic retry.
  4. The manual re-run at 15:25 passed after 28minutes into 'main'.

While I'm looking at it...it's impossible for us to deduce any pattern from the int tests (in general). They run in random order, and three in parallel. We can fix the order of them, but cannot control which tests happen to be executing in parallel at any given moment. If we run them serially, we'll easily start hitting the 2-hour hard timeout (for the "remote" tests).

Curiously...eyeballing your list above. The 'sys remote' lists appear to have the greatest number of incidents. This is likely the best, most actionable data-set, given they operate one at a time and in a fixed order. Is there any way you can include task "duration" with your reporting (for the above, and going-forward)?

That might help narrow down if a particular test is causing trouble, given a (roughly) fixed execution speed.

@cevich
Copy link
Member

cevich commented Oct 21, 2020

I've spent quite some time pondering this issue today.

I double-checked, and confirmed we are using the "Premium" (highest) tier networking in GCP. Cirrus-CI uses GKE in the same zone as our VMs, though the packets would be routed over the internet briefly, the latency (from agent-processes to their service) will be quite low. Beyond complaining about some specific outage or hiccup, there's not much we can control on the networking side of things.

Similarly, we can't control for problems within the agent process itself (bugs). Though for this class of issue, Cirrus-CI would notice wide-spread problems (beyond just our projects). Really, the only thing we can have any direct control over is what we do on the VMs, and how they're setup. Unfortunately, for this class of agent-affecting problems, we completely loose all traces of the cause when Cirrus-CI clobbers the VM 😖

Focusing on what we can control, and following Occam's Razor, it seems to me the simplest theory would be:

Some test, or podman operation has a bug which occasionally hogs CPU/Memory/Network resources from the agent.

@baude @mheon WDYT about if we (even temporarily) disable the ginkgo test-randomization? The idea being, tests would run in roughly the same order every time. If there's a "bad" test (or podman operation), the failures would cluster around certain VM lifetimes before failure. If there's a way for @edsantiago to log the "agent-failed" flake task ID's, I can pull the timing data via the Cirrus-CI GraphQL interface.

Despite loosing the logs at time of failure, pinning down a duration-before-failure, would greatly reduce the number of haystacks we need to search through. For example, if jobs run in sequence, and we graph the time-to-failure, we might be able to see failure-clusters which can then be loosely correlated to a smaller number of tests. That would be very conducive to more intense scrutiny on those tests only, narrowing down the set of possible causes.

Save for additional insights...this is about the best I can do to support resolution of this issue. In any case, it's going to be a LONG and slow slog 😞

@edsantiago
Copy link
Member Author

@cevich the error is happening on sys tests too, which are not randomized.

@cevich
Copy link
Member

cevich commented Oct 26, 2020

Yes...hmmm, maybe this is an okay starting place then to check if it's a similar set of tests which are causing trouble?

Another thought I just had: It's possible to block destruction of a VM (by cirrus), from on the VM itself (using gcloud). So, maybe it's time for a temporary PR with that change. Perhaps script it to allow VM destruction only after the system tests complete. Then we simply re-run the tests repeatedly until we're left with a set of orphaned VMs, representing only failure cases.

@edsantiago
Copy link
Member Author

Here's a new one:

Failed to stop: I/O error

@cevich
Copy link
Member

cevich commented Oct 28, 2020

@edsantiago Thanks for reporting, this is a different/unrelated problem. IIRC it's networking related, something to do with the backend of Cirrus-CI. They're aware of the problem and have taken steps to lower occurrences (extending timeouts / adding retry capability).

@cevich
Copy link
Member

cevich commented Oct 29, 2020

I Managed to catch three Fedora VMs in one go that hit this problem. The VMs are completely hung, inaccessible by ssh and don't respond to pings. However, since I blocked Cirrus-CI from removing them, I was able to examine their serial-console output. This is what appears (more or less) on every one:

[ 1203.905090] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1203.912275] #PF: supervisor read access in kernel mode
[ 1203.917588] #PF: error_code(0x0000) - not-present page
[ 1203.922881] PGD 8000000124e2d067 P4D 8000000124e2d067 PUD 138d5e067 PMD 0 
[ 1203.929939] Oops: 0000 [#1] SMP PTI
[ 1203.933584] CPU: 1 PID: 224075 Comm: podman Not tainted 5.8.12-200.fc32.x86_64 #1
[ 1203.941238] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[ 1203.950632] RIP: 0010:__list_del_entry_valid+0x2d/0x4c
[ 1203.955945] Code: 4c 8b 47 08 48 b8 00 01 00 00 00 00 ad de 48 39 c2 0f 84 71 00 00 00 48 b8 22 01 00 00 00 00 ad de 49 39 c0 0f 84 91 00 00 00 <49> 8b 30 48 39 fe 0f 85 71 00 00 00 48 8b 52 08 48 39 f2 0f 85 56
[ 1203.974869] RSP: 0018:ffffa21d81a37990 EFLAGS: 00010017
[ 1203.980246] RAX: dead000000000122 RBX: ffff8e5f73f11150 RCX: 0000000000000000
[ 1203.987534] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8e5f61e4f138
[ 1203.994839] RBP: ffff8e5f61e4f098 R08: 0000000000000000 R09: ffff8e5f5e84be78
[ 1204.002131] R10: ffff8e5f5e84be00 R11: ffff8e5f5e84be00 R12: ffff8e5f61e4f010
[ 1204.009421] R13: ffff8e5f5e84be00 R14: 0000000000000000 R15: ffff8e5f73f51a40
[ 1204.016709] FS:  00007f55bb665700(0000) GS:ffff8e5f7bd00000(0000) knlGS:0000000000000000
[ 1204.024958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1204.030873] CR2: 0000000000000000 CR3: 0000000136bf0006 CR4: 00000000001606e0
[ 1204.038186] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1204.045477] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1204.052766] Call Trace:
[ 1204.055440]  bfq_idle_extract+0x52/0xb0
[ 1204.059468]  bfq_put_idle_entity+0x12/0x60
[ 1204.063722]  bfq_bfqq_served+0xb0/0x190
[ 1204.067718]  bfq_dispatch_request+0x2c2/0x1070
[ 1204.072370]  blk_mq_do_dispatch_sched+0xa4/0x140
[ 1204.077175]  __blk_mq_sched_dispatch_requests+0x159/0x160
[ 1204.082728]  blk_mq_sched_dispatch_requests+0x30/0x60
[ 1204.087940]  __blk_mq_run_hw_queue+0x49/0x110
[ 1204.092496]  __blk_mq_delay_run_hw_queue+0x15b/0x170
[ 1204.097618]  blk_mq_sched_insert_requests+0x70/0xf0
[ 1204.102665]  blk_mq_flush_plug_list+0xfb/0x1a0
[ 1204.107304]  blk_finish_plug+0x36/0x50
[ 1204.111250]  ext4_writepages+0xaec/0xfd0
[ 1204.116053]  ? iov_iter_copy_from_user_atomic+0xe6/0x3b0
[ 1204.121617]  ? balance_dirty_pages_ratelimited+0xa0/0x380
[ 1204.127170]  ? generic_perform_write+0x121/0x190
[ 1204.131944]  do_writepages+0x28/0xa0
[ 1204.135747]  ? wbc_attach_and_unlock_inode+0xc5/0x130
[ 1204.140956]  __filemap_fdatawrite_range+0xa7/0xe0
[ 1204.145854]  file_write_and_wait_range+0x61/0xb0
[ 1204.150632]  ext4_sync_file+0xed/0x350
[ 1204.154565]  __x64_sys_fdatasync+0x47/0x80
[ 1204.158849]  do_syscall_64+0x4d/0x90
[ 1204.162614]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1204.167852] RIP: 0033:0x4c3ebb
[ 1204.171065] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 cb 40 f8 ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[ 1204.189985] RSP: 002b:000000c00002cd28 EFLAGS: 00000212 ORIG_RAX: 000000000000004b
[ 1204.197710] RAX: ffffffffffffffda RBX: 000000c00006a000 RCX: 00000000004c3ebb
[ 1204.205001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
[ 1204.212304] RBP: 000000c00002cd68 R08: 0000000000000001 R09: 000000c0001de000
[ 1204.219597] R10: 0000000000000000 R11: 0000000000000212 R12: 000000000000004e
[ 1204.226889] R13: 000000000000004d R14: 0000000000000200 R15: 0000000000000000
[ 1204.234184] Modules linked in: ip6table_filter ip6table_nat ip6_tables xt_multiport xt_nat xt_addrtype xt_mark xt_conntrack iptable_filter xt_MASQUERADE xt_comment iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth overlay tun bridge stp llc sunrpc intel_rapl_msr intel_rapl_common virtio_net sb_edac rapl net_failover failover i2c_piix4 pvpanic fuse drm ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_scsi serio_raw
[ 1204.275177] CR2: 0000000000000000
[ 1204.278647] ---[ end trace 592670dad46cafd9 ]---
[ 1204.283424] RIP: 0010:__list_del_entry_valid+0x2d/0x4c
[ 1204.288717] Code: 4c 8b 47 08 48 b8 00 01 00 00 00 00 ad de 48 39 c2 0f 84 71 00 00 00 48 b8 22 01 00 00 00 00 ad de 49 39 c0 0f 84 91 00 00 00 <49> 8b 30 48 39 fe 0f 85 71 00 00 00 48 8b 52 08 48 39 f2 0f 85 56
[ 1204.309135] RSP: 0018:ffffa21d81a37990 EFLAGS: 00010017
[ 1204.314515] RAX: dead000000000122 RBX: ffff8e5f73f11150 RCX: 0000000000000000
[ 1204.321806] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8e5f61e4f138
[ 1204.329117] RBP: ffff8e5f61e4f098 R08: 0000000000000000 R09: ffff8e5f5e84be78
[ 1204.336431] R10: ffff8e5f5e84be00 R11: ffff8e5f5e84be00 R12: ffff8e5f61e4f010
[ 1204.343734] R13: ffff8e5f5e84be00 R14: 0000000000000000 R15: ffff8e5f73f51a40
[ 1204.351053] FS:  00007f55bb665700(0000) GS:ffff8e5f7bd00000(0000) knlGS:0000000000000000
[ 1204.359400] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1204.365312] CR2: 0000000000000000 CR3: 0000000136bf0006 CR4: 00000000001606e0
[ 1204.372603] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1204.379910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

The great news about this is: I can work around it by switching to a different I/O scheduler.
The bad news: This bug was supposedly fixed a long time ago, and confirmed as such.

@cevich
Copy link
Member

cevich commented Oct 29, 2020

Welp...it appears there are a TON of bugs open and re-opened on this: https://bugzilla.redhat.com/buglist.cgi?quicksearch=bfq So it's not only us seeing this problem re-emerge. One of the bugs contains a lively discussion re: resetting the default to 'deadline' for all types of storage, going forward.

@edsantiago
Copy link
Member Author

@cevich nice work! Thank you!

@cevich
Copy link
Member

cevich commented Oct 30, 2020

Your welcome...I'm really happy it wasn't some obscure cause as I feared, and something I've seen before.

After reading through the various bugzillas, this is definitely a re-emergence in Fedora and now affecting RHEL also. Previously, the podman integration tests were instrumental in helping upstream develop a fix. However, that took MONTHS, and being twice-burned by this, I'm shy about participating again 😕

My current (Friday) thinking: Insert head into pile of finely granulated silicates.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants