Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Root reprovisioning CI failures #619

Closed
cgwalters opened this issue Aug 31, 2020 · 15 comments · Fixed by coreos/ignition#1093
Closed

Root reprovisioning CI failures #619

cgwalters opened this issue Aug 31, 2020 · 15 comments · Fixed by coreos/ignition#1093
Assignees
Labels
jira for syncing to jira

Comments

@cgwalters
Copy link
Member

The tests in coreos/fedora-coreos-config#503 are failing periodically in CI runs on multiple repos, e.g.:

Seems to relate to SELinux in at least some cases.

Sohan and I couldn't reproduce this locally but I am increasingly thinking it's a race condition.

@cgwalters
Copy link
Member Author

If it's a race, one reason this might appear in CI versus locally is that we're running a lot of other tests in parallel in CI. I'm trying this in a local run:

$ kola run --parallel 8 --multiply 8 'ext.fedora-coreos-config.*'

Another possible reason this is happening in CI is those nodes have a lot of physical CPUs.

@cgwalters
Copy link
Member Author

cgwalters commented Aug 31, 2020

OK exciting, I had to go to --parallel 32 --multiply 32 to hit this interesting race:

XFS (vdc4): Internal error !uuid_equal(&mp->m_sb.sb_uuid, &head->h_fs_uuid) at line 278 of file fs/xfs/xfs_log_recover.c.  Caller xlog_header_check_mount+0x35/0xc0 [xfs]
CPU: 0 PID: 528 Comm: mount Not tainted 5.7.17-200.fc32.x86_64 coreos/fedora-coreos-config#1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack+0x64/0x88
 xfs_corruption_error+0x85/0x90 [xfs]
 ? xlog_header_check_mount+0x35/0xc0 [xfs]
 xlog_header_check_mount+0x62/0xc0 [xfs]
 ? xlog_header_check_mount+0x35/0xc0 [xfs]
 xlog_find_verify_log_record+0xfe/0x200 [xfs]
 xlog_find_head+0x1b0/0x370 [xfs]
 xlog_find_tail+0x34/0x340 [xfs]
 ? try_to_wake_up+0x26a/0x750
 xlog_recover+0x1c/0x140 [xfs]
 ? xfs_trans_ail_init+0xa8/0xd0 [xfs]
 xfs_log_mount+0x156/0x2b0 [xfs]
 xfs_mountfs+0x431/0x8a0 [xfs]
 ? xfs_mru_cache_create+0x12d/0x180 [xfs]
 xfs_fc_fill_super+0x35f/0x590 [xfs]
 ? xfs_setup_devices+0x80/0x80 [xfs]
 get_tree_bdev+0x15c/0x250
 vfs_get_tree+0x25/0xb0
 do_mount+0x7b7/0xad0
 ? memdup_user+0x4e/0x90
 __x64_sys_mount+0x8e/0xd0
 do_syscall_64+0x5b/0xf0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f0e32af857e
Code: 48 8b 0d 1d 09 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ea 08 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff6a08c5e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f0e32c1f204 RCX: 00007f0e32af857e
RDX: 000056300c41cb70 RSI: 000056300c415db0 RDI: 000056300c417a90
RBP: 000056300c415b60 R08: 0000000000000000 R09: 000056300c415010
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000056300c417a90 R14: 000056300c41cb70 R15: 000056300c415b60
XFS (vdc4): Corruption detected. Unmount and run xfs_repair
XFS (vdc4): log has mismatched uuid - can't recover
XFS (vdc4): failed to find log head
XFS (vdc4): log mount/recovery failed: error -117
mount: /sysroot: mount(2) system call failed: Structure needs cleaning.
XFS (vdc4): log mount failed

I think this is something like: we need to xfs_freeze after unmounting the rootfs before reboting?

@cgwalters
Copy link
Member Author

A few runs of kola run --parallel 12 --multiply 32 'ext.fedora-coreos-config.root-reprovision.*' all passed without hitting anything. May try to reproduce in the coreosci environment once I have a dev namespace.

cgwalters referenced this issue in cgwalters/fedora-coreos-config Sep 4, 2020
This reverts commit 522ca22.

These are constantly failing in CI, we don't know why yet.
ref https://github.com/coreos/fedora-coreos-config/issues/591
@jlebon
Copy link
Member

jlebon commented Sep 8, 2020

Will look into this.

@jlebon jlebon self-assigned this Sep 8, 2020
@jlebon jlebon changed the title CI flakes from https://github.com/coreos/fedora-coreos-config/pull/503 Root reprovisioning CI failures Sep 8, 2020
@jlebon jlebon transferred this issue from coreos/fedora-coreos-config Sep 8, 2020
@jlebon jlebon added the jira for syncing to jira label Sep 8, 2020
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Sep 9, 2020
We have a new Clevis release now with the fixes we need, so add the
packages to the manifest. This is all that's needed to support
root-on-LUKS since the rest of the rootfs replacement stack is already
LUKS-aware.

Sample Ignition config:

```
{
  "ignition": {
    "version": "3.2.0-experimental"
  },
  "storage": {
    "luks": [
      {
        "name": "myluksdev",
        "device": "/dev/disk/by-id/virtio-disk1",
        "clevis": {
          "tpm2": true
        },
        "label": "root"
      }
    ],
    "filesystems": [
      {
        "device": "/dev/disk/by-id/dm-name-myluksdev",
        "format": "xfs",
        "wipeFilesystem": true,
        "label": "root"
      }
    ]
  }
}
```

Not adding tests for now until we've resolved
coreos/fedora-coreos-tracker#619.
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Sep 9, 2020
We have a new Clevis release now with the fixes we need, so add the
packages to the manifest. This is all that's needed to support
root-on-LUKS since the rest of the rootfs replacement stack is already
LUKS-aware.

Sample Ignition config:

```
{
  "ignition": {
    "version": "3.2.0-experimental"
  },
  "storage": {
    "luks": [
      {
        "name": "myluksdev",
        "device": "/dev/disk/by-id/virtio-disk1",
        "clevis": {
          "tpm2": true
        },
        "label": "root"
      }
    ],
    "filesystems": [
      {
        "device": "/dev/mapper/myluksdev",
        "format": "xfs",
        "wipeFilesystem": true,
        "label": "root"
      }
    ]
  }
}
```

Not adding tests for now until we've resolved
coreos/fedora-coreos-tracker#619.
@jlebon
Copy link
Member

jlebon commented Sep 11, 2020

OK exciting, I had to go to --parallel 32 --multiply 32 to hit this interesting race:

Do you remember which of the two tests hit this? (filesystem-only or raid1?)

@cgwalters
Copy link
Member Author

Do you remember which of the two tests hit this? (filesystem-only or raid1?)

I'm not sure, and looks like my kola run got GC'd. If I had to guess I'd say filesystem-only.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Sep 11, 2020
This reverts commit 77787a0.

I can't reproduce the failures in
coreos/fedora-coreos-tracker#619 locally or on
the CI cluster, and the logs from previous failures are stale. Let's
re-enable the tests and if it comes back we'll debug more deeply.
@jlebon
Copy link
Member

jlebon commented Sep 11, 2020

I can't reproduce this, either locally or on the CI cluster. Let's optimistically revert the revert for now and I'll dig into errors if they pop out?

cgwalters pushed a commit to coreos/fedora-coreos-config that referenced this issue Sep 11, 2020
This reverts commit 77787a0.

I can't reproduce the failures in
coreos/fedora-coreos-tracker#619 locally or on
the CI cluster, and the logs from previous failures are stale. Let's
re-enable the tests and if it comes back we'll debug more deeply.
@jlebon
Copy link
Member

jlebon commented Sep 11, 2020

Let's keep this open for now. If anyone sees this happening again, please post the logs as an attachment here so it doesn't get GC'ed.

@cgwalters
Copy link
Member Author

https://jenkins-coreos-ci.apps.ocp.ci.centos.org/blue/organizations/jenkins/github-ci%2Fcoreos%2Fcoreos-assembler/detail/PR-1715/6/pipeline
had the raid1 test hang apparently indefinitely:

[    5.097711] ignition[433]: op(1): [finished] loading QEMU firmware config module
[    5.107959] systemd[1]: Condition check resulted in Ignition (fetch) being skipped.
[�[0;32m  OK  �[0m] Finished �[0;1;39mdracut pre-mount hook�[0m.
[    5.116769] systemd[1]: Starting Ignition OSTree: detect rootfs replacement...
[    5.122724] ignition[433]: fetch-offline: fetch-offline passed
[    5.126681] systemd[1]: Finished Copy CoreOS Firstboot Networking Config.
[    5.134695] ignition[433]: Ignition finished successfully
[    5.136552] systemd[1]: Starting dracut initqueue hook...
[    5.138407] systemd[1]: Finished dracut initqueue hook.
[    5.144700] systemd[1]: Reached target Remote File Systems (Pre).
[    5.146754] systemd[1]: Reached target Remote File Systems.
[    5.150697] systemd[1]: Condition check resulted in Acquire live PXE rootfs image being skipped.
[�[0;32m  OK  �[0m] Finished �[0;1;39mIgnition OSTree: detect rootfs replacement�[0m.
         Starting �[0;1;39mIgnition OSTree: save rootfs�[0m...
[    5.163909] ignition-ostree-dracut-rootfs[442]: Detected rootfs replacement in fetched Ignition config: /run/ignition.json
[    5.168333] systemd[1]: Condition check resulted in Persist osmet files (PXE) being skipped.
[    5.171938] systemd[1]: Starting dracut pre-mount hook...
[    5.174295] systemd[1]: Finished dracut pre-mount hook.
[    5.189550] systemd[1]: Finished Ignition OSTree: detect rootfs replacement.
[    5.191891] systemd[1]: Starting Ignition OSTree: save rootfs...
[    5.326556] SGI XFS with ACLs, security attributes, scrub, quota, no debug enabled
[    5.331608] XFS (vdc4): Mounting V5 Filesystem
[    5.342457] XFS (vdc4): Ending clean mount
[    5.346204] xfs filesystem being mounted at /sysroot supports timestamps until 2038 (0x7fffffff)
[    5.350338] ignition-ostree-dracut-rootfs[461]: Moving rootfs to RAM...
[�[0m�[0;31m*     �[0m] A start job is running for Ignition…STree: save rootfs (5s / no limit)
�M
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] A start job is running for Ignition…STree: save rootfs (5s / no limit)
�M
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] A start job is running for Ignition…STree: save rootfs (6s / no limit)
�M
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] A start job is running for Ignition…STree: save rootfs (6s / no limit)
�M
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] A start job is running for Ignition…STree: save rootfs (7s / no limit)
�M
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] A start job is running for Ignition…STree: save rootfs (7s / no limit)
�M
�[K[    �[0;31m*�[0;1;31m*�[0m] A start job is running for Ignition…STree: save rootfs (8s / no limit)
�M
�[K[     �[0;31m*�[0m] A start job is running for Ignition…STree: save rootfs (9s / no limit)
[   11.764682] XFS (vdc4): Unmounting Filesystem
�M
�[K[    �[0;31m*�[0;1;31m*�[0m] A start job is running for Ignition…STree: save rootfs (9s / no limit)
[   11.862630] systemd[1]: Finished Ignition OSTree: save rootfs.
�M
�[K[�[0;32m  OK  �[0m] Finished �[0;1;39mIgnition OSTree: save rootfs�[0m.
�[K[   11.866260] kauditd_printk_skb: 19 callbacks suppressed
[   11.866261] audit: type=1130 audit(1600124341.468:30): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=ignition-ostree-rootfs-save comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   11.875167] systemd[1]: Starting Ignition (disks)...
         Starting �[0;1;39mIgnition (disks)�[0m...
[   11.890131] ignition[476]: Ignition 2.6.0
[   11.891777] ignition[476]: Stage: disks
[   11.894807] ignition[476]: reading system config file "/usr/lib/ignition/base.ign"
[   11.901967] ignition[476]: disks: createRaids: op(1): [started]  waiting for devices [/dev/disk/by-id/virtio-disk1 /dev/disk/by-id/virtio-disk2]
[   11.911678] ignition[476]: disks: createRaids: op(1): [finished] waiting for devices [/dev/disk/by-id/virtio-disk1 /dev/disk/by-id/virtio-disk2]
[   11.915716] ignition[476]: disks: createRaids: created device alias for "/dev/disk/by-id/virtio-disk1": "/run/ignition/dev_aliases/dev/disk/by-id/virtio-disk1" -> "/dev/vda"
[   11.920660] ignition[476]: disks: createRaids: created device alias for "/dev/disk/by-id/virtio-disk2": "/run/ignition/dev_aliases/dev/disk/by-id/virtio-disk2" -> "/dev/vdb"
[   11.925725] ignition[476]: disks: createRaids: op(2): [started]  creating "foobar"
[   12.034804] md/raid1:md127: not clean -- starting background reconstruction
[   12.037095] md/raid1:md127: active with 2 out of 2 mirrors
[   12.039095] md127: detected capacity change from 0 to 5363466240
[   12.044198] md: resync of RAID array md127
[   12.163572] ignition[476]: disks: createRaids: op(2): [finished] creating "foobar"
[   12.181909] ignition[476]: disks: createFilesystems: op(3): [started]  waiting for devices [/dev/md/foobar]
[   12.185192] ignition[476]: disks: createFilesystems: op(3): [finished] waiting for devices [/dev/md/foobar]
[   12.188500] ignition[476]: disks: createFilesystems: created device alias for "/dev/md/foobar": "/run/ignition/dev_aliases/dev/md/foobar" -> "/dev/md127"
[   12.192382] ignition[476]: disks: createFilesystems: op(4): [started]  determining filesystem type of "/dev/md/foobar"
[   12.207473] ignition[476]: disks: createFilesystems: op(4): [finished] determining filesystem type of "/dev/md/foobar"
[   12.212738] ignition[476]: disks: createFilesystems: found  filesystem at "/dev/md/foobar" with uuid "" and label ""
[   12.216217] ignition[476]: disks: createFilesystems: op(5): [started]  wiping filesystem signatures from "/run/ignition/dev_aliases/dev/md/foobar"
[   12.229653] ignition[476]: disks: createFilesystems: op(5): [finished] wiping filesystem signatures from "/run/ignition/dev_aliases/dev/md/foobar"
[   12.239151] ignition[476]: disks: createFilesystems: op(6): [started]  creating "xfs" filesystem on "/run/ignition/dev_aliases/dev/md/foobar"
[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] A start job is running for Ignition (disks) (12s / no limit)
�M
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] A start job is running for Ignition (disks) (12s / no limit)
[   15.195801] ignition[476]: disks: createFilesystems: op(6): [finished] creating "xfs" filesystem on "/run/ignition/dev_aliases/dev/md/foobar"
[   15.214291] ignition[476]: disks: op(7): [started]  waiting for udev to settle
[   15.281972] ignition[476]: disks: op(7): [finished] waiting for udev to settle
[   15.285835] ignition[476]: disks: disks passed
�M
�[K[[   15.295958] ignition[476]: Ignition finished successfully
�[0;32m  OK  �[0m] Finished �[0;1;39mIgnition (disks)�[0m.
�[K[   15.301102] audit: type=1130 audit(1600124344.903:31): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=ignition-disks comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[�[0;32m  OK  �[0m] Reached target �[0;1;39mInitrd Root Device�[0m.
[   15.313912] systemd[1]: Finished Ignition (disks).
         Starting �[0;1;39mIgnition OSTree: restore rootfs�[0m...
[   15.325062] systemd[1]: Reached target Initrd Root Device.
         Starting �[0;1;39mIgnition OSTree: …rate filesystem UUID (root)�[0m...
[   15.341523] systemd[1]: Starting Ignition OSTree: restore rootfs...
[   15.345379] systemd[1]: Starting Ignition OSTree: Regenerate filesystem UUID (root)...
[�[0;32m  OK  �[0m] [   15.360753] ignition-ostree-firstboot-uuid[498]: No changes required for /dev/disk/by-label/root TYPE=xfs UUID=67f15e39-f361-4364-9fdc-1786bc168312
Finished �[0;1;39mIgnition OSTree: …nerate filesystem UUID (root)�[0m.
[   15.367086] audit: type=1130 audit(1600124344.969:32): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=ignition-ostree-uuid-root comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   15.374568] systemd[1]: Finished Ignition OSTree: Regenerate filesystem UUID (root).
[   15.378564] XFS (md127): Mounting V5 Filesystem
[   15.389628] XFS (md127): Ending clean mount
[   15.395693] xfs filesystem being mounted at /sysroot supports timestamps until 2038 (0x7fffffff)
[   15.400226] ignition-ostree-dracut-rootfs[497]: Restoring rootfs from RAM...
[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] A start job is running for Ignition…e: restore rootfs (15s / no limit)
�M
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] A start job is running for Ignition…e: restore rootfs (15s / no limit)
�M
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] A start job is running for Ignition…e: restore rootfs (16s / no limit)
�M
�[K[�[0m�[0;31m*     �[0m] A start job is running for Ignition…e: restore rootfs (17s / no limit)
�M
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] A start job is running for Ignition…e: restore rootfs (17s / no limit)
�M
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] A start job is running for Ignition…e: restore rootfs (18s / no limit)
�M
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] A start job is running for Ignition…e: restore rootfs (18s / no limit)
�M
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] A start job is running for Ignition…e: restore rootfs (19s / no limit)
�M
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] A start job is running for Ignition…e: restore rootfs (19s / no limit)
�M
�[K[    �[0;31m*�[0;1;31m*�[0m] A start job is running for Ignition…e: restore rootfs (20s / no limit)
�M
�[K[     �[0;31m*�[0m] A start job is running for Ignition…e: restore rootfs (21s / no limit)
�M
�[K[    �[0;31m*�[0;1;31m*�[0m] A start job is running for Ignition…e: restore rootfs (21s / no limit)
�M
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] A start job is running for Ignition…e: restore rootfs (22s / no limit)
�M
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] A start job is running for Ignition…e: restore rootfs (22s / no limit)
�M
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] A start job is running for Ignition…e: restore rootfs (23s / no limit)
�M
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] A start job is running for Ignition…e: restore rootfs (24s / no limit)
�M
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] A start job is running for Ignition…e: restore rootfs (24s / no limit)
�M
�[K[�[0m�[0;31m*     �[0m] A start job is running for Ignition…e: restore rootfs (25s / no limit)
�M
�[K[�[0;1;31m*�[0m�[0;31m*    �[0m] A start job is running for Ignition…e: restore rootfs (25s / no limit)
�M
�[K[�[0;31m*�[0;1;31m*�[0m�[0;31m*   �[0m] A start job is running for Ignition…e: restore rootfs (26s / no limit)
�M
�[K[ �[0;31m*�[0;1;31m*�[0m�[0;31m*  �[0m] A start job is running for Ignition…e: restore rootfs (26s / no limit)
�M
�[K[  �[0;31m*�[0;1;31m*�[0m�[0;31m* �[0m] A start job is running for Ignition…e: restore rootfs (27s / no limit)
�M
�[K[   �[0;31m*�[0;1;31m*�[0m�[0;31m*�[0m] A start job is running for Ignition…e: restore rootfs (28s / no limit)

@jlebon
Copy link
Member

jlebon commented Sep 15, 2020

Both tests from Sohan's links above show SELinux-related issues:

[   71.843503] SELinux:  Context unlabeled is not valid (left unmapped).
[   71.850076] systemd[1]: Failed to compute init label, ignoring.
[   71.851847] systemd[1]: Successfully loaded SELinux policy in 457.134ms.
...
[�[0;32m  OK  �[0m] Listening on �[0;1;39mDevice-mapper event daemon FIFOs�[0m.
[   72.466410] systemd[1]: lvm2-lvmpolld.socket: Failed to determine SELinux label: Invalid argument
[   72.469014] systemd[1]: lvm2-lvmpolld.socket: Failed to listen on sockets: Invalid argument
[   72.471594] systemd[1]: lvm2-lvmpolld.socket: Failed with result 'resources'.
[   72.473514] systemd[1]: Failed to listen on LVM2 poll daemon socket.
[�[0;1;31mFAILED�[0m] Failed to listen on �[0;1;39mLVM2 poll daemon socket�[0m.
See 'systemctl status lvm2-lvmpolld.socket' for details.
[   72.477160] systemd[1]: multipathd.socket: Failed to determine SELinux label: Invalid argument
[   72.479823] systemd[1]: multipathd.socket: Failed to listen on sockets: Invalid argument
[   72.482379] systemd[1]: multipathd.socket: Failed with result 'resources'.
[   72.484246] systemd[1]: Failed to listen on multipathd control socket.
[�[0;1;31mFAILED�[0m] Failed to listen on �[0;1;39mmultipathd control socket�[0m.
See 'systemctl status multipathd.socket' for details.
...

Hmm, almost seems like there's a mismatch between the labels in the policy and the ones on disk?

I'm trying to reproduce this manually on the same CI cluster but not having any luck so far. I'll have to poke at this through PRs.

@jlebon
Copy link
Member

jlebon commented Sep 15, 2020

One interesting thing is that the revert was merged 4 days ago and only now we're seeing failures. This coincides with the cluster currently having NFS issues: https://pagure.io/centos-infra/issue/26#comment-685779.

Ordinarily these tests run in an emptyDir, though Jenkins still has to write back the job logs over NFS and perhaps hangs in the hypervisor is causing hangs in the guest? Not sure.

@cgwalters
Copy link
Member Author

So what's different about these jobs though from a CI perspective? Is it the amount of data written into the qcow2 overlay?
I wonder if it's something like us writing that qcow2 data into the container overlayfs in /var/tmp instead of the emptydir? Hopefully it's not ending up on NFS somehow.

jlebon added a commit to jlebon/ignition that referenced this issue Sep 15, 2020
The loopback issue is fixed now and we need a newer kernel for the
rootfs reprovisioning tests to work.

Closes: coreos/fedora-coreos-tracker#619
jlebon added a commit to jlebon/ignition that referenced this issue Sep 15, 2020
The loopback issue should be fixed now and we need a newer kernel for
the rootfs reprovisioning tests to work.

Closes: coreos/fedora-coreos-tracker#619
@jlebon
Copy link
Member

jlebon commented Sep 15, 2020

OK, finally got to the end of this. I think it's only Ignition which is consistently failing on this and that is fixed by coreos/ignition#1093. It should've been obvious that we were using an older kernel which doesn't support reading labels. The smoking gun was comparing the output of setfiles in a successful test vs a broken one:

[   42.030869] ignition-ostree-populate-var[725]: Relabeled /sysroot//var/home from (null) to system_u:object_r:home_root_t:s0
...

vs

[   55.415904] ignition-ostree-populate-var[674]: Relabeled /sysroot//var/mnt from unlabeled to system_u:object_r:var_t:s0
...

Notice the (null) vs unlabeled here. Prior to 5.7.9, unlabeled is what the kernel returns when trying to read a label before the policy is loaded.

I do think there are other flakes, but at this point I think they're rare enough to live with for now. (At the very least, there's the xfs traceback Colin found above, and the raid1 test sometimes hanging). My proposal is to open separate issues for those if they become a real problem and we're inclined to dig deeper.

jlebon added a commit to jlebon/ignition that referenced this issue Sep 15, 2020
The loopback issue should be fixed now and we need a newer kernel for
the rootfs reprovisioning tests to work.

Closes: coreos/fedora-coreos-tracker#619
@miabbott
Copy link
Member

miabbott commented Feb 16, 2022

While chasing down problems seen on a hotfixed RHCOS 4.9 build, we observed some races between randomizing the rootfs UUID and mounting the rootfs. The fix for this is believed to be in coreos/fedora-coreos-config#1357

Speaking with @sandeen about the race, he identified this issue that could be traced to the same root cause.

See also downstream BZ https://bugzilla.redhat.com/show_bug.cgi?id=2055258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants