Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] arm64 kola test failures #474

Closed
14 tasks done
wrl opened this issue Aug 10, 2021 · 13 comments
Closed
14 tasks done

[Tracking] arm64 kola test failures #474

wrl opened this issue Aug 10, 2021 · 13 comments
Assignees
Labels
kind/bug Something isn't working platform/arm64

Comments

@wrl
Copy link

wrl commented Aug 10, 2021

SELinux-related (see PR: flatcar-archive/coreos-overlay#1245):

Missing multiarch images (see PR: flatcar-archive/coreos-overlay#1179):

Polkit related (see PR: flatcar-archive/coreos-overlay#1263):

  • cl.basic/DbusPerms: kolet: Process exited with status 1

according to @pothos, these are related to "kernel soft lockup coming from slow QEMU emulation".

"waiting for UPDATE_STATUS_UPDATED_NEED_REBOOT: time limit exceeded"

  • cl.update.payload

"cluster failed starting machines: connect: connection refused"

  • cl.ignition.v2.btrfsroot

Also ref #470 to normalise the test running environment. As it stands now, kola is happy to run its test suite without verifying that the host system is configured properly (and, what constitutes "properly" is still somewhat hazy).

@wrl wrl added kind/bug Something isn't working platform/arm64 labels Aug 10, 2021
@wrl wrl self-assigned this Aug 10, 2021
@pothos
Copy link
Member

pothos commented Aug 10, 2021

cl.ignition.v2.btrfsroot and cl.update.payload both should work, and if there are temporary failures it's due to the kernel soft lockup coming from slow QEMU emulation.

@wrl
Copy link
Author

wrl commented Aug 10, 2021

@pothos What's the path forward on those, then? Are we sunk unless we're testing those on actual hardware?

@pothos
Copy link
Member

pothos commented Aug 10, 2021

We can tweak the QEMU invocation but otherwise it's like on amd64 where some tests have to be rerun, too.

@wrl
Copy link
Author

wrl commented Aug 10, 2021

Alright, have updated the list to reflect this.

@pothos
Copy link
Member

pothos commented Aug 10, 2021

For completeness, here the paste of failure messages from the 2955.0.0 release:
https://pastebin.com/raw/y93uUVs1

The cl.flannel.* tests also seem to be related to flanneld.service itself not coming up which causes systemctl is-system-running to return a failure.
The underlying cause there is that quay.io/coreos/flannel:v0.12.0 is not a multiarch image (one has to explicitly use quay.io/coreos/flannel:v0.12.0-arm64) and this can be resolved by updating to quay.io/coreos/flannel:v0.13.0 or quay.io/coreos/flannel:v0.14.0 which both are multiarch images.

@pothos
Copy link
Member

pothos commented Aug 10, 2021

Same problem for etcd: The quay.io/coreos/etcd:v3.3.25 image is not a multiarch image, and we need to update to quay.io/coreos/etcd:v3.5.0

@tormath1
Copy link
Contributor

AFAIK SELinux things are not installed on arm64 - that's why basic tests are failing because it does not find the SELinux binary tools.

@pothos
Copy link
Member

pothos commented Aug 11, 2021

Yes, I started a PR for installing SELinux some time ago and it needs to be done again for the current main: flatcar-archive/coreos-overlay#135

@pothos
Copy link
Member

pothos commented Aug 11, 2021

I also got a temporary verity test failure and filed a PR to make it more robust: flatcar/mantle#202

@tormath1
Copy link
Contributor

tormath1 commented Aug 17, 2021

@pothos @wrl for docker.userns I get it working by removing the failing setenforce 1 but it appears that we have some race condition with torcx - that would explain why we did not have the same test result in our latest meeting.

Sometimes, torcx is not sealed so we can't run docker -> The program docker is managed by torcx, which did not run..

Comparing the two console.txt from _kola_temp ; we have the following:

// success
$ cat _kola_temp/qemu-2021-08-17-1538-10573/docker.userns/99f84f72-844c-4c59-8df9-be4b180e164c/console.txt | grep -i torcx
[   77.765874] systemd[1]: Condition check resulted in Populate torcx store to satisfy profile being skipped.
[  217.515997] systemd[1]: Reached target Verify torcx succeeded.
[  OK  ] Reached target Verify torcx succeeded.
// failure
$ cat _kola_temp/qemu-latest/docker.userns/9fae9ad7-40ac-4c98-8ced-9f209d2c507d/console.txt| grep -i torcx
[   67.623773] systemd[1]: Condition check resulted in Populate torcx store to satisfy profile being skipped.

And comparing journal logs:

// success
$ cat _kola_temp/qemu-2021-08-17-1538-10573/docker.userns/99f84f72-844c-4c59-8df9-be4b180e164c/journal.txt | grep -i torcx
...
Aug 17 15:40:26.839093 /usr/lib/systemd/system-generators/torcx-generator[755]: time="2021-08-17T13:40:26Z" level=info msg="store skipped" err="open /usr/share/oem/torcx/store: no such file or directory" path=/usr/share/oem/torcx/store
Aug 17 15:40:26.843044 /usr/lib/systemd/system-generators/torcx-generator[755]: time="2021-08-17T13:40:26Z" level=info msg="store skipped" err="open /var/lib/torcx/store/2942.0.0: no such file or directory" path=/var/lib/torcx/store/2942.0.0
Aug 17 15:40:26.846258 /usr/lib/systemd/system-generators/torcx-generator[755]: time="2021-08-17T13:40:26Z" level=info msg="store skipped" err="open /var/lib/torcx/store: no such file or directory" path=/var/lib/torcx/store
Aug 17 15:41:27.152038 /usr/lib/systemd/system-generators/torcx-generator[755]: time="2021-08-17T13:41:27Z" level=debug msg="image unpacked" image=docker path=/run/torcx/unpack/docker reference=com.coreos.cl
...
// failure
$ cat _kola_temp/qemu-latest/docker.userns/9fae9ad7-40ac-4c98-8ced-9f209d2c507d/journal.txt | grep -i torcx
...
Aug 17 16:06:53.561320 /usr/lib/systemd/system-generators/torcx-generator[756]: time="2021-08-17T14:06:53Z" level=info msg="store skipped" err="open /usr/share/oem/torcx/store: no such file or directory" path=/usr/share/oem/torcx/store
Aug 17 16:06:53.593780 /usr/lib/systemd/system-generators/torcx-generator[756]: time="2021-08-17T14:06:53Z" level=info msg="store skipped" err="open /var/lib/torcx/store/2942.0.0: no such file or directory" path=/var/lib/torcx/store/2942.0.0
Aug 17 16:06:53.595867 /usr/lib/systemd/system-generators/torcx-generator[756]: time="2021-08-17T14:06:53Z" level=info msg="store skipped" err="open /var/lib/torcx/store: no such file or directory" path=/var/lib/torcx/store

⬆️ on the success we have the image unpacked log and friends but on the failure, it stops on the last store skipped then no more news from torcx like it was just killed.

EDIT: by connecting on a machine with a failing Docker; we can see that the torcx-generator seems to have ran:

$ /usr/lib/systemd/system-generators/torcx-generator
DEBU[0000] common configuration parsed                   base_dir=/var/lib/torcx/ conf_dir=/etc/torcx/ run_dir=/run/torcx/ store_paths="[/usr/share/torcx/store /usr/share/oem/torcx/store/2942.0.0 /usr/share/oem/torcx/store /var/lib/torcx/store/2942.0.0 /var/lib/torcx/store]"
INFO[0000] torcx already run
$ docker ps
The program docker is managed by torcx, which did not run.

The issue might be on the sealing part then...

EDIT (bis): I suspect we have an error on the unpack / sealing part but this one is not caught / displayed in the logs. According to the systemd-generator doc, generators should even not rely on syslog but torcx does use syslog. Also there is this issue: systemd/systemd#15638

Right now any errors in systemd generators go nowhere unless the generator is configured to send output to kmsg

I'll give a try to forward errors to /dev/kmsg

EDIT (ter): I compared the /run/torcx/unpack/docker/bin/ :

working docker:

core@localhost ~ $ ls -ali /run/torcx/unpack/docker/
total 0
 2 drwxr-xr-x 6 root root 120 Aug 19 13:27 .
 1 drwxr-xr-x 3 root root  60 Aug 19 13:26 ..
 3 drwxr-xr-x 2 root root  60 Aug 19 13:26 .torcx
 5 drwxr-xr-x 2 root root 320 Aug 19 13:27 bin
20 drwxr-xr-x 3 root root 100 Aug 19 13:27 lib
33 drwxr-xr-x 3 root root  60 Aug 19 13:27 usr

failing docker:

$ ls -aliZ /run/torcx/unpack/docker/
total 0
2 drwxr-xr-x 4 root root ?  80 Aug 19 12:51 .
1 drwxr-xr-x 3 root root ?  60 Aug 19 12:51 ..
3 drwxr-xr-x 2 root root ?  60 Aug 19 12:51 .torcx
5 drwxr-xr-x 2 root root ? 180 Aug 19 12:51 bin

and even the /bin directory has less elements. So it seems that it's the root cause - the untar is killed by memory issue I suppose but no OOM killer message so it's silently killed

FINAL EDIT: we increased the memory for ARM64 QEMU - let's see. (flatcar/mantle#211)

@pothos
Copy link
Member

pothos commented Aug 26, 2021

I found one more case where the tests with verity would go wrong when not using QEMU: flatcar/mantle#218

@pothos
Copy link
Member

pothos commented Sep 1, 2021

I created a new PR to install selinux tools and enable selinux on arm64 flatcar-archive/coreos-overlay#1245 (replacing flatcar-archive/coreos-overlay#135).
This will hopefully allow us to enable the SELinux tests for arm64 again.

@pothos
Copy link
Member

pothos commented Sep 3, 2021

The tests now pass with SELinux enabled:

ok - coreos.selinux.enforce
ok - coreos.selinux.boolean
ok - docker.userns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/arm64
Projects
None yet
Development

No branches or pull requests

3 participants