Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

placeholder issue for quay.io flakes #16973

Closed
edsantiago opened this issue Jan 2, 2023 · 22 comments
Closed

placeholder issue for quay.io flakes #16973

edsantiago opened this issue Jan 2, 2023 · 22 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

edsantiago commented Jan 2, 2023

This is a placeholder only. I'm seeing a lot of quay.io flakes. We (containers team) can't do anything about those, but the way my flake logger works, the individual failures are in skillions of different tests, which makes it hard for my brain to grok. Creating a single issue, and assigning them all here, makes it much easier to track (and hence to ignore).

The most common symptom is: (EDIT: removed, this turned out to be a different bug)

Symptoms are:

Error: reading blob sha256:xxx: fetching blob: received unexpected HTTP status: 502 Bad Gateway

and

Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/[...]": remote error: tls: handshake failure

and

Error: copying system image from manifest list: parsing image configuration: Get "https://cdn03.quay.io/[...] dial tcp: lookup cdn03.quay.io: no such host
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jan 2, 2023
@edsantiago
Copy link
Member Author

This is hitting us really hard. Reopening.

@edsantiago
Copy link
Member Author

From last 18 days:

@edsantiago
Copy link
Member Author

I've got my bug report written but not yet sent to quay. Before I send... I reviewed flake logs, and see this one:

Error: initializing source docker://k8s.gcr.io/pause:3.2: pinging container registry k8s.gcr.io: Get "https://k8s.gcr.io/v2/": dial tcp: lookup k8s.gcr.io: no such host

...which is close enough to the cdn03.quay.io flake that I'm putting my bug report on pause. I do not see other non-quay instances (other than deliberate ones used in testing). What confidence do we have that this is not a bug in our infrastructure, or even in podman? (with quay appearing more often because that's what we use)?

@rhatdan
Copy link
Member

rhatdan commented Jan 25, 2023

Could this be something where we need better retries. Although this looks like name resolution is failing.

@edsantiago
Copy link
Member Author

I noticed this morning, and reported on my (stalled) quay ticket, that cdn03.quay.io is a CNAME whereas cdn0[12] have A records. Also, they all have a TTL of 60. DNS experts, brainstorm please, how can that possibly be relevant?

@edsantiago
Copy link
Member Author

Last five days. This is no longer our top flake (the "happened during" one, #17193, is now top position) but it's still a problem. I'm wondering if they're actually the same issue, presenting in different ways.

@edsantiago
Copy link
Member Author

Here's the response from quay support:

As I already mentioned, we are not aware of any ongoing issues with Quay.io.

Please double-check on your end if the following endpoints/ports are enabled on the firewall:

https://docs.openshift.com/container-platform/4.12/installing/install_config/configuring-firewall.html#configuring-firewall_configuring-firewall

Also, please enable podman debug mode so we can get a better overview of the issue [1]

[1] https://access.redhat.com/solutions/394744

Obviously, the firewall is fine, because this happens only with cdn03.quay.io. And obviously, we can't enable debug logs because this is in tests that rely on checking output. So we're on our own.

@edsantiago
Copy link
Member Author

Found instances in Fedora CI. January 5:

# Trying to pull docker.io/library/alpine:latest...
# Error: creating build container: copying system image from manifest list: parsing image configuration:
   Get "https://cdn03.quay.io/sha256/96/961769676411f082461f9ef46626dd7a2d1e2b2a38e6a44364bcbecf51e66dd4
          ?X-Amz-Algorithm=AWS4-HMAC-SHA256
          &X-Amz-Credential=AKIAI5LUAQGPZRPNKSJA%2F20230105%2Fus-east-1%2Fs3%2Faws4_request
          &X-Amz-Date=20230105T103439Z
          &X-Amz-Expires=600
          &X-Amz-SignedHeaders=host
          &X-Amz-Signature=c41cfd33f39bd14d1c54b8d3e17a7a3a89c065dbadd19ca0ee20adc396e8fd58
          &cf_sign=D2KIzF3p3koqfnru3PENyDGTQueiCW1uMTRdTbJkPjOUk%2B7BwIdfRE32rp03Jb3JXiKvHMC6UePqNQzRY7cT3gghnVMXVnmKXlp%2B%2FrsapQtGlNJyMpxQ0gLWcesBUv4b7ex95zPpoXZfcIImjVizDHfD5X7bZFMLqQf3oUgmFsElcuO%2BD%2B0NeQEwr7XPLiXv1T9W86fXqjO4J1tTz9QJNMloIxlssezMe4IIjcIsHl8LOO%2FdfH8Mq0AGCSBrlVJjvFAi%2BrtsL2qb7mfPvCKiBY%2F5Ouhs%2BBpWQGqCkSd2yiM7mQVPqhORIzGIHTUH%2B3BHaLL5mGQ0UxMRviMWT4OIw%3D%3D
          &cf_expiry=1672915479
          &region=us-east-1": dial tcp: lookup cdn03.quay.io: no such host

Also this one and this one, all of them January 5, all of them Fedora gating tests. All errors look similar, so I won't paste the full headers; if they're important to know, someone please grab them quick because Fedora CI logs don't last long.

That tentatively rules out a problem local to our CI setup.

@edsantiago
Copy link
Member Author

I do not see this flake in RHEL gating tests. I've scoured logs of the last 50 runs, both automatedly via script and manually (to double-check my script), and see only one place where cdn0 or \.quay\.io appear. It's a tls handshake failure, not a DNS lookup failure, so I filed it under the Other issue. (Previously filed under the unexpected EOF issue).

Given the frequency with which this issue triggers on Fedora, I find it very curious not to find it in RHEL. So curious, that I spun up a VM to look at something.....

# grep ^hosts /etc/nsswitch.conf
hosts:      files dns myhostname

@edsantiago
Copy link
Member Author

The problem persists (see below)... but only in old branches and in the build step (which, I think, runs before my systemd-resolve-disable step). I haven't seen it in other tests on main.

PR traffic has been very low the last few days, so this is not a significant result... but it's an interesting one.

@edsantiago
Copy link
Member Author

Followup to my postscrum this morning: the non-test flake above happened in prebuild, which (if I'm reading the YML correctly) happens only in the Build steps - not on tests. I don't think it's worth introducing complexity into prebuild for this. (And, I hope needless today, we do not want to backport the systemd-disable hack to maint branches).

As a reminder, the point here is not to disable systemd-resolved. That would be great for CI, but would leave us with failures in gating tests and in customerland. My hope here is to isolate the conditions leading to the flake, and maybe learn from that.

@edsantiago
Copy link
Member Author

No new cdn03 flakes today. And in response to my comment yesterday, about doing the systemd fix earlier... I think it's actually helpful to keep things as they are: if we eliminate the flake window completely, we don't know if quay fixed something and the flake is truly gone. With us only seeing the flake in a much smaller window, it suggests to me that the problem is with systemd-resolved.

@github-actions
Copy link

github-actions bot commented Apr 5, 2023

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Apr 5, 2023

@edsantiago any update?

@edsantiago
Copy link
Member Author

Disabling systemd-resolve (#17505) REALLY helped. We still see the flakes in CI steps that run before my disable, so I have containers/automation_images#269 in progress to update CI VMs. That won't help old branches, nor will it help the other quay flakes. But still, this is the flake list from the last 25 days:

We were getting multiple flakes per day. This is (not counting the v4.x ones) five in almost a month.

@dustymabe
Copy link
Contributor

@edsantiago we've been seeing some quay flakes in FCOS (which also uses systemd-resolved too: coreos/fedora-coreos-pipeline#852

One thing that is interesting to me is that we only see the flakes on our aarch64 machine which happens to sit in AWS (the other machines don't). By chance does your CI run in AWS?

@jlebon
Copy link
Contributor

jlebon commented May 8, 2023

@edsantiago Have you had a chance to reach out to systemd maintainers about systemd-resolved being the source of flakes in your CI?

@edsantiago
Copy link
Member Author

@jlebon I haven't tried, and am unlikely to.

@lnykryn
Copy link

lnykryn commented May 10, 2023

I do not see this flake in RHEL gating tests.

Just for the record, we don't install resolved by default on RHEL. We even don't officially support it, it is marked as a technical preview.

@cevich
Copy link
Member

cevich commented Aug 7, 2023

FWIW, just did a google search for this problem and came up with an interesting hit. Not sure what to make of that other than it's maybe a problem somebody has experienced in OpenShift land.

@cevich
Copy link
Member

cevich commented Aug 7, 2023

I'm attempting to remove @edsantiago workaround in #19541 where we can run CI a few times to see if it's still flaking.

@edsantiago
Copy link
Member Author

Closing. The cdn03 flake -- far and above the most pernicious one -- has been fixed by avoiding systemd-resolver. We still see frequent flakes on quay itself, but that's not really something we can fix.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 24, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

6 participants