placeholder issue for quay.io flakes #16973

edsantiago · 2023-01-02T18:39:37Z

This is a placeholder only. I'm seeing a lot of quay.io flakes. We (containers team) can't do anything about those, but the way my flake logger works, the individual failures are in skillions of different tests, which makes it hard for my brain to grok. Creating a single issue, and assigning them all here, makes it much easier to track (and hence to ignore).

~~The most common symptom is:~~ (EDIT: removed, this turned out to be a different bug)

Symptoms are:

Error: reading blob sha256:xxx: fetching blob: received unexpected HTTP status: 502 Bad Gateway

and

Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/[...]": remote error: tls: handshake failure

and

Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/[...] dial tcp: lookup cdn03.quay.io: no such host

The text was updated successfully, but these errors were encountered:

edsantiago · 2023-01-09T16:19:34Z

This is hitting us really hard. Reopening.

fedora-36 : sys podman fedora-36 root host
- PR Update vendor of containters/(common, image) #16885
  - 12-19 09:16
fedora-36 : sys remote fedora-36 root host [remote]
- PR New system tests for conflicting options #16911
  - 12-21 07:33
fedora-37 : bud podman fedora-37 root host
- PR New system tests for conflicting options #16911
- PR Vendor in latest containers/(buildah, image, common) #16910
  - 12-21 07:32
fedora-37 : bud remote fedora-37 root host [remote]
- PR Vendor in latest containers/(buildah, image, common) #16910
  - 12-21 07:31
  - 12-21 07:31
fedora-37 : int remote fedora-37 root host [remote]
- PR Allow '/' to prefix container names to match Docker #16820
  - 12-29 06:40
fedora-37 : machine podman fedora-37 rootless host
- PR Update win-sshproxy to 0.5.0 gvisor tag #17023
  - 01-06 20:02
- PR [CI:DOCS] Preprocess files in UTF-8 mode #17015
  - 01-05 15:14
fedora-37 : sys podman fedora-37 root host
- PR Vendor in latest containers/(buildah, image, common) #16910
  - 12-21 07:30
fedora-37 : sys remote fedora-37 root host [remote]
- PR New system tests for conflicting options #16911
  - 12-21 07:33
  - 12-21 07:33
fedora-37-aarch64 : Build for fedora-37-aarch64
fedora-37-aarch64 : sys podman fedora-37-aarch64 root host
fedora-37-aarch64 : sys remote fedora-37-aarch64 root host [remote]

edsantiago · 2023-01-25T18:10:30Z

From last 18 days:

fedora-36 : int podman fedora-36 root host
- PR Make rootless privileged containers share the same tty devices as rootfull ones #17127
  - 01-16 09:50
fedora-36 : int remote fedora-36 root host [remote]
- PR container kill: handle stopped/exited container #17126
  - 01-16 08:26
fedora-36 : sys podman fedora-36 root host
fedora-36 : sys podman fedora-36 rootless host
- PR Quadlet Kube - add support for PublishPort key #17068
  - 01-12 04:34
fedora-37 : bud podman fedora-37 root host
- PR Bump to v4.4.0-rc2 #17143
  - 01-17 15:06
- PR network: add support for podman network update and --network-dns-server #16732
  - 01-12 11:10
fedora-37 : bud remote fedora-37 root host [remote]
- PR container kill: handle stopped/exited container #17126
- PR Vendor in latest containers/(image,ocicrypt) #17125
fedora-37 : int podman fedora-37 root host
- PR Only prevent VTs to be mounted inside privileged systemd containers #17055
  - 01-11 10:26
fedora-37 : int podman fedora-37 rootless host
- PR Only prevent VTs to be mounted inside privileged systemd containers #17055
  - 01-11 10:26
fedora-37 : int remote fedora-37 root host [remote]
- PR Make rootless privileged containers share the same tty devices as rootfull ones #17127
  - 01-16 09:50
- PR container kill: handle stopped/exited container #17126
  - 01-16 08:50
  - 01-16 08:50
fedora-37 : machine podman fedora-37 rootless host
fedora-37 : sys remote fedora-37 rootless host [remote]
- PR container kill: handle stopped/exited container #17126
  - 01-16 09:44
  - 01-16 09:44
fedora-37-aarch64 : Build for fedora-37-aarch64
fedora-37-aarch64 : machine podman fedora-37-aarch64 rootless host
fedora-37-aarch64 : sys podman fedora-37-aarch64 root host
fedora-37-aarch64 : sys remote fedora-37-aarch64 root host [remote]

edsantiago · 2023-01-25T19:12:27Z

I've got my bug report written but not yet sent to quay. Before I send... I reviewed flake logs, and see this one:

Error: initializing source docker://k8s.gcr.io/pause:3.2: pinging container registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io: no such host

...which is close enough to the cdn03.quay.io flake that I'm putting my bug report on pause. I do not see other non-quay instances (other than deliberate ones used in testing). What confidence do we have that this is not a bug in our infrastructure, or even in podman? (with quay appearing more often because that's what we use)?

rhatdan · 2023-01-25T19:29:46Z

Could this be something where we need better retries. Although this looks like name resolution is failing.

edsantiago · 2023-01-31T18:28:44Z

I noticed this morning, and reported on my (stalled) quay ticket, that cdn03.quay.io is a CNAME whereas cdn0[12] have A records. Also, they all have a TTL of 60. DNS experts, brainstorm please, how can that possibly be relevant?

edsantiago · 2023-02-07T14:46:08Z

Last five days. This is no longer our top flake (the "happened during" one, #17193, is now top position) but it's still a problem. I'm wondering if they're actually the same issue, presenting in different ways.

fedora-37 : machine podman fedora-37 rootless host
- PR network ls: handle removed container #17376
  - 02-07 04:50 in [machine] TestLibpod
- PR [v4.4] make hack/markdown-preprocess parallel-safe #17360
  - 02-03 19:12 in TestLibpod
- PR libpod: allow userns=keep-id for root #17350
  - 02-03 07:46 in TestLibpod
- PR libpod: support idmap for --rootfs #17274
  - 02-02 17:37 in TestLibpod
- PR Cirrus: Use versionable IMAGE_SUFFIX #17166
  - 02-01 13:57 in TestLibpod
fedora-37-aarch64 : sys podman fedora-37-aarch64 root host
- PR Add quadlet support for Rootfs and SELinux labels containers #17352
  - 02-06 14:19 in [sys] podman load - multi-image archive
- PR Expose Podman named pipe in Inspect output #17303
  - 02-02 18:32 in [sys] build with copy-from referencing the base image

edsantiago · 2023-02-07T14:53:07Z

Here's the response from quay support:

As I already mentioned, we are not aware of any ongoing issues with Quay.io.

Please double-check on your end if the following endpoints/ports are enabled on the firewall:

https://docs.openshift.com/container-platform/4.12/installing/install_config/configuring-firewall.html#configuring-firewall_configuring-firewall

Also, please enable podman debug mode so we can get a better overview of the issue [1]

[1] https://access.redhat.com/solutions/394744

Obviously, the firewall is fine, because this happens only with cdn03.quay.io. And obviously, we can't enable debug logs because this is in tests that rely on checking output. So we're on our own.

edsantiago · 2023-02-14T18:37:25Z

Found instances in Fedora CI. January 5:

# Trying to pull docker.io/library/alpine:latest...
# Error: creating build container: copying system image from manifest list: parsing image configuration:
   Get "https://cdn03.quay.io/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
          ?X-Amz-Algorithm=AWS4-HMAC-SHA256
          &X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230105%2Fus-east-1%2Fs3%2Faws4_request
          &X-Amz-Date=20230105T103439Z
          &X-Amz-Expires=600
          &X-Amz-SignedHeaders=host
          &X-Amz-Signature=c41cfd33f39bd14d1c54b8d3e17a7a3a89c065dbadd19ca0ee20adc396e8fd58
          &cf_sign=D2KIzF3p3koqfnru3PENyDGTQueiCW1uMTRdTbJkPjOUk%2B7BwIdfRE32rp03Jb3JXiKvHMC6UePqNQzRY7cT3gghnVMXVnmKXlp%2B%2FrsapQtGlNJyMpxQ0gLWcesBUv4b7ex95zPpoXZfcIImjVizDHfD5X7bZFMLqQf3oUgmFsElcuO%2BD%2B0NeQEwr7XPLiXv1T9W86fXqjO4J1tTz9QJNMloIxlssezMe4IIjcIsHl8LOO%2FdfH8Mq0AGCSBrlVJjvFAi%2BrtsL2qb7mfPvCKiBY%2F5Ouhs%2BBpWQGqCkSd2yiM7mQVPqhORIzGIHTUH%2B3BHaLL5mGQ0UxMRviMWT4OIw%3D%3D
          &cf_expiry=1672915479
          &region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host

Also this one and this one, all of them January 5, all of them Fedora gating tests. All errors look similar, so I won't paste the full headers; if they're important to know, someone please grab them quick because Fedora CI logs don't last long.

That tentatively rules out a problem local to our CI setup.

edsantiago · 2023-02-14T20:28:34Z

I do not see this flake in RHEL gating tests. I've scoured logs of the last 50 runs, both automatedly via script and manually (to double-check my script), and see only one place where cdn0 or \.quay\.io appear. It's a tls handshake failure, not a DNS lookup failure, so I filed it under the Other issue. (Previously filed under the unexpected EOF issue).

Given the frequency with which this issue triggers on Fedora, I find it very curious not to find it in RHEL. So curious, that I spun up a VM to look at something.....

# grep ^hosts /etc/nsswitch.conf
hosts:      files dns myhostname

edsantiago · 2023-02-20T22:29:01Z

The problem persists (see below)... but only in old branches and in the build step (which, I think, runs before my systemd-resolve-disable step). I haven't seen it in other tests on main.

PR traffic has been very low the last few days, so this is not a significant result... but it's an interesting one.

fedora-37-aarch64 : Build for fedora-37-aarch64
- PR make docs: sanity check for broken man pages #17576
  - 02-20 08:01
fedora-37-aarch64 : sys podman fedora-37-aarch64 root host
- PR RHEL8 gating-test skips and tweaks #17531
fedora-37-aarch64 : sys remote fedora-37-aarch64 root host [remote]
- PR Backport #17528 to v4.4 #17532
  - 02-17 10:30
- PR RHEL8 gating-test skips and tweaks #17531
  - 02-17 10:41
  - 02-17 10:40

edsantiago · 2023-02-21T19:37:38Z

Followup to my postscrum this morning: the non-test flake above happened in prebuild, which (if I'm reading the YML correctly) happens only in the Build steps - not on tests. I don't think it's worth introducing complexity into prebuild for this. (And, I hope needless today, we do not want to backport the systemd-disable hack to maint branches).

As a reminder, the point here is not to disable systemd-resolved. That would be great for CI, but would leave us with failures in gating tests and in customerland. My hope here is to isolate the conditions leading to the flake, and maybe learn from that.

edsantiago · 2023-02-23T22:14:55Z

No new cdn03 flakes today. And in response to my comment yesterday, about doing the systemd fix earlier... I think it's actually helpful to keep things as they are: if we eliminate the flake window completely, we don't know if quay fixed something and the flake is truly gone. With us only seeing the flake in a much smaller window, it suggests to me that the problem is with systemd-resolved.

github-actions · 2023-04-05T00:06:27Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2023-04-05T11:32:02Z

@edsantiago any update?

edsantiago · 2023-04-05T11:47:19Z

Disabling systemd-resolve (#17505) REALLY helped. We still see the flakes in CI steps that run before my disable, so I have containers/automation_images#269 in progress to update CI VMs. That won't help old branches, nor will it help the other quay flakes. But still, this is the flake list from the last 25 days:

fedora-36-aarch64 : sys podman fedora-36-aarch64 root host
- PR [v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864
  - 03-21 05:49
fedora-37 : bud remote fedora-37 root host boltdb [remote]
- PR Support Deployment generation with kube generate #17950
  - 03-31 15:03 in [bud] bud-with-mount-with-tmpfs-like-buildkit
fedora-37-aarch64 : Build for fedora-37-aarch64
fedora-37-aarch64 : machine podman fedora-37-aarch64 rootless host
fedora-37-aarch64 : sys podman fedora-37-aarch64 root host

We were getting multiple flakes per day. This is (not counting the v4.x ones) five in almost a month.

dustymabe · 2023-05-05T20:03:27Z

@edsantiago we've been seeing some quay flakes in FCOS (which also uses systemd-resolved too: coreos/fedora-coreos-pipeline#852

One thing that is interesting to me is that we only see the flakes on our aarch64 machine which happens to sit in AWS (the other machines don't). By chance does your CI run in AWS?

jlebon · 2023-05-08T16:29:41Z

@edsantiago Have you had a chance to reach out to systemd maintainers about systemd-resolved being the source of flakes in your CI?

edsantiago · 2023-05-08T16:32:10Z

@jlebon I haven't tried, and am unlikely to.

lnykryn · 2023-05-10T08:44:51Z

I do not see this flake in RHEL gating tests.

Just for the record, we don't install resolved by default on RHEL. We even don't officially support it, it is marked as a technical preview.

cevich · 2023-08-07T18:35:25Z

FWIW, just did a google search for this problem and came up with an interesting hit. Not sure what to make of that other than it's maybe a problem somebody has experienced in OpenShift land.

cevich · 2023-08-07T18:40:14Z

I'm attempting to remove @edsantiago workaround in #19541 where we can run CI a few times to see if it's still flaking.

edsantiago · 2023-10-25T14:04:32Z

Closing. The cdn03 flake -- far and above the most pernicious one -- has been fixed by avoiding systemd-resolver. We still see frequent flakes on quay itself, but that's not really something we can fix.

edsantiago added the flakes Flakes from Continuous Integration label Jan 2, 2023

edsantiago closed this as completed Jan 2, 2023

This was referenced Jan 4, 2023

[CI:DOCS] Describe copy volume options #16991

Merged

[CI:DOCS] Preprocess files in UTF-8 mode #17015

Merged

edsantiago reopened this Jan 9, 2023

edsantiago mentioned this issue Jan 25, 2023

copying system image from manifest list: writing/storing blob: happened during read: unexpected EOF #17193

Open

edsantiago mentioned this issue Jan 30, 2023

network flakes: other #17288

Open

github-actions bot added the stale-issue label Apr 5, 2023

edsantiago removed the stale-issue label Apr 5, 2023

dustymabe mentioned this issue May 5, 2023

network infra flakes for quay.io cdn DNS coreos/fedora-coreos-pipeline#852

Open

psss mentioned this issue Jul 27, 2023

Repeated failures when fetching the centos:7 image teemtee/tmt#2063

Closed

AverageMarcus mentioned this issue Sep 12, 2023

Investigate running our own in-cluster container registry for CI builds giantswarm/roadmap#2642

Closed

edsantiago closed this as completed Oct 25, 2023

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 24, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

placeholder issue for quay.io flakes #16973

placeholder issue for quay.io flakes #16973

edsantiago commented Jan 2, 2023 •

edited

Loading

edsantiago commented Jan 9, 2023

edsantiago commented Jan 25, 2023

edsantiago commented Jan 25, 2023

rhatdan commented Jan 25, 2023

edsantiago commented Jan 31, 2023

edsantiago commented Feb 7, 2023

edsantiago commented Feb 7, 2023

edsantiago commented Feb 14, 2023

edsantiago commented Feb 14, 2023

edsantiago commented Feb 20, 2023

edsantiago commented Feb 21, 2023

edsantiago commented Feb 23, 2023

github-actions bot commented Apr 5, 2023

rhatdan commented Apr 5, 2023

edsantiago commented Apr 5, 2023

dustymabe commented May 5, 2023

jlebon commented May 8, 2023

edsantiago commented May 8, 2023

lnykryn commented May 10, 2023

cevich commented Aug 7, 2023

cevich commented Aug 7, 2023

edsantiago commented Oct 25, 2023

placeholder issue for quay.io flakes #16973

placeholder issue for quay.io flakes #16973

Comments

edsantiago commented Jan 2, 2023 • edited Loading

edsantiago commented Jan 9, 2023

edsantiago commented Jan 25, 2023

edsantiago commented Jan 25, 2023

rhatdan commented Jan 25, 2023

edsantiago commented Jan 31, 2023

edsantiago commented Feb 7, 2023

edsantiago commented Feb 7, 2023

edsantiago commented Feb 14, 2023

edsantiago commented Feb 14, 2023

edsantiago commented Feb 20, 2023

edsantiago commented Feb 21, 2023

edsantiago commented Feb 23, 2023

github-actions bot commented Apr 5, 2023

rhatdan commented Apr 5, 2023

edsantiago commented Apr 5, 2023

dustymabe commented May 5, 2023

jlebon commented May 8, 2023

edsantiago commented May 8, 2023

lnykryn commented May 10, 2023

cevich commented Aug 7, 2023

cevich commented Aug 7, 2023

edsantiago commented Oct 25, 2023

edsantiago commented Jan 2, 2023 •

edited

Loading