-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Root reprovisioning CI failures #619
Comments
If it's a race, one reason this might appear in CI versus locally is that we're running a lot of other tests in parallel in CI. I'm trying this in a local run:
Another possible reason this is happening in CI is those nodes have a lot of physical CPUs. |
OK exciting, I had to go to
I think this is something like: we need to |
A few runs of |
This reverts commit 522ca22. These are constantly failing in CI, we don't know why yet. ref https://github.com/coreos/fedora-coreos-config/issues/591
Will look into this. |
We have a new Clevis release now with the fixes we need, so add the packages to the manifest. This is all that's needed to support root-on-LUKS since the rest of the rootfs replacement stack is already LUKS-aware. Sample Ignition config: ``` { "ignition": { "version": "3.2.0-experimental" }, "storage": { "luks": [ { "name": "myluksdev", "device": "/dev/disk/by-id/virtio-disk1", "clevis": { "tpm2": true }, "label": "root" } ], "filesystems": [ { "device": "/dev/disk/by-id/dm-name-myluksdev", "format": "xfs", "wipeFilesystem": true, "label": "root" } ] } } ``` Not adding tests for now until we've resolved coreos/fedora-coreos-tracker#619.
We have a new Clevis release now with the fixes we need, so add the packages to the manifest. This is all that's needed to support root-on-LUKS since the rest of the rootfs replacement stack is already LUKS-aware. Sample Ignition config: ``` { "ignition": { "version": "3.2.0-experimental" }, "storage": { "luks": [ { "name": "myluksdev", "device": "/dev/disk/by-id/virtio-disk1", "clevis": { "tpm2": true }, "label": "root" } ], "filesystems": [ { "device": "/dev/mapper/myluksdev", "format": "xfs", "wipeFilesystem": true, "label": "root" } ] } } ``` Not adding tests for now until we've resolved coreos/fedora-coreos-tracker#619.
Do you remember which of the two tests hit this? ( |
I'm not sure, and looks like my kola run got GC'd. If I had to guess I'd say |
This reverts commit 77787a0. I can't reproduce the failures in coreos/fedora-coreos-tracker#619 locally or on the CI cluster, and the logs from previous failures are stale. Let's re-enable the tests and if it comes back we'll debug more deeply.
I can't reproduce this, either locally or on the CI cluster. Let's optimistically revert the revert for now and I'll dig into errors if they pop out? |
This reverts commit 77787a0. I can't reproduce the failures in coreos/fedora-coreos-tracker#619 locally or on the CI cluster, and the logs from previous failures are stale. Let's re-enable the tests and if it comes back we'll debug more deeply.
Let's keep this open for now. If anyone sees this happening again, please post the logs as an attachment here so it doesn't get GC'ed. |
https://jenkins-coreos-ci.apps.ocp.ci.centos.org/blue/organizations/jenkins/github-ci%2Fcoreos%2Fcoreos-assembler/detail/PR-1715/6/pipeline
|
Both tests from Sohan's links above show SELinux-related issues:
Hmm, almost seems like there's a mismatch between the labels in the policy and the ones on disk? I'm trying to reproduce this manually on the same CI cluster but not having any luck so far. I'll have to poke at this through PRs. |
One interesting thing is that the revert was merged 4 days ago and only now we're seeing failures. This coincides with the cluster currently having NFS issues: https://pagure.io/centos-infra/issue/26#comment-685779. Ordinarily these tests run in an emptyDir, though Jenkins still has to write back the job logs over NFS and perhaps hangs in the hypervisor is causing hangs in the guest? Not sure. |
So what's different about these jobs though from a CI perspective? Is it the amount of data written into the qcow2 overlay? |
The loopback issue is fixed now and we need a newer kernel for the rootfs reprovisioning tests to work. Closes: coreos/fedora-coreos-tracker#619
The loopback issue should be fixed now and we need a newer kernel for the rootfs reprovisioning tests to work. Closes: coreos/fedora-coreos-tracker#619
OK, finally got to the end of this. I think it's only Ignition which is consistently failing on this and that is fixed by coreos/ignition#1093. It should've been obvious that we were using an older kernel which doesn't support reading labels. The smoking gun was comparing the output of
vs
Notice the I do think there are other flakes, but at this point I think they're rare enough to live with for now. (At the very least, there's the xfs traceback Colin found above, and the |
The loopback issue should be fixed now and we need a newer kernel for the rootfs reprovisioning tests to work. Closes: coreos/fedora-coreos-tracker#619
While chasing down problems seen on a hotfixed RHCOS 4.9 build, we observed some races between randomizing the rootfs UUID and mounting the rootfs. The fix for this is believed to be in coreos/fedora-coreos-config#1357 Speaking with @sandeen about the race, he identified this issue that could be traced to the same root cause. See also downstream BZ https://bugzilla.redhat.com/show_bug.cgi?id=2055258 |
The tests in coreos/fedora-coreos-config#503 are failing periodically in CI runs on multiple repos, e.g.:
Seems to relate to SELinux in at least some cases.
Sohan and I couldn't reproduce this locally but I am increasingly thinking it's a race condition.
The text was updated successfully, but these errors were encountered: