Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd.unified_cgroup_hierarchy=0 kernel argument missing on new nodes #710

Closed
spasche opened this issue Jun 23, 2021 · 24 comments · Fixed by openshift/okd-machine-os#154
Closed

Comments

@spasche
Copy link

spasche commented Jun 23, 2021

Describe the bug

I deployed new nodes on a OKD cluster with user provisioned infrastructure using libvirt KVM virtual machines.
Cluster was at version 4.7.0-0.okd-2021-06-13-090745 when new nodes were added.
On the new nodes, we could see pods failing. For instance, BuildConfig pods were failing with the error:

error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/memory/memory.limit_in_bytes: no such file or directory

After investigations, I noticed that the new nodes were missing the systemd.unified_cgroup_hierarchy=0 kernel boot parameter, which exposes /sys/fs/cgroup/memory/.

rpm-ostree status on the new nodes:

rpm-ostree status
State: idle
Deployments:
● pivot://quay.io/openshift/okd-content@sha256:ce3e69194860476064a7a36075a9de3b46b2eaf6851ceedae41a69e5124a3637
              CustomOrigin: Managed by machine-config-operator
                   Version: 47.34.202106191111-0 (2021-06-19T11:14:24Z)

  pivot://quay.io/openshift/okd-content@sha256:ce3e69194860476064a7a36075a9de3b46b2eaf6851ceedae41a69e5124a3637
              CustomOrigin: Managed by machine-config-operator
                   Version: 47.34.202106191111-0 (2021-06-19T11:14:24Z)

As a workaround, I deployed the following MachineConfig:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-openshift-machineconfig-worker-kargs
spec:
  kernelArguments:
    - 'systemd.unified_cgroup_hierarchy=0'

Version

4.7.0-0.okd-2021-06-13-090745 when the new nodes were added.
User provisioned infrastructure using libvirt KVM virtual machines.
Cluster later updated to 4.7.0-0.okd-2021-06-19-191547 (which didn't fix the issue).

How reproducible

The same issue happened on the 5 nodes that were added.

Log bundle

I can't attach the link here for confidentiality reasons. I'll happily send the link by email.

@vrutkovs
Copy link
Member

Check that 99-okd-master-disable-mitigations machine config has REVMRVRFIG1pdGlnYXRpb25zPWF1dG8sbm9zbXQKQUREIHN5c3RlbWQudW5pZmllZF9jZ3JvdXBfaGllcmFyY2h5PTAK, which is base64 for:

DELETE mitigations=auto,nosmt
ADD systemd.unified_cgroup_hierarchy=0

This was resolved in 06-19-191547

I can't attach the link here for confidentiality reasons

Remove confidential information from must-gather and upload it to any public cloud service?

@spasche
Copy link
Author

spasche commented Jun 23, 2021

Hi Vadim,

Thanks for the quick response. Indeed, the machine config is missing the kernel argument:

oc get mc 99-okd-master-disable-mitigations -o jsonpath='{.spec.config.storage.files[0].contents.source}'|cut -d, -f2|base64 -d
DELETE mitigations=auto,nosmt

But the cluster is at version 06-19-191547:

oc get clusterversion 
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.okd-2021-06-19-191547   True        False         2d4h    Cluster version is 4.7.0-0.okd-2021-06-19-191547

Remove confidential information from must-gather and upload it to any public cloud service?

I was wondering if that would strip too much information. Here's the archive with some of the files removed: https://1drv.ms/u/s!AloPbwYP-ZZVhYANQM67l38sGsSUMg?e=1gWxqh

@vrutkovs
Copy link
Member

Seems CVO is not tracking manifests from machine-os-content.

Workaround: apply 99-okd-master-disable-mitigations and 99-okd-worker-disable-mitigations manually

@spasche
Copy link
Author

spasche commented Jun 28, 2021

Thanks, I'll do that 👍

@spasche spasche closed this as completed Jun 28, 2021
@vrutkovs
Copy link
Member

Lets keep this open - the fix for this landed in 4.8/4.7 nightlies, but didn't make it to stable just yet

@spasche
Copy link
Author

spasche commented Jun 29, 2021

I'm still facing the issue. The machineconfig creates the /etc/pivot/kernel-args file on the nodes, but the kernel parameters are not applied.

oc describe node w13|grep currentConfig
    machineconfiguration.openshift.io/currentConfig: rendered-worker-1822ff96e4a1ce5a40dcf6636c9c1f50

oc get mc/rendered-worker-1822ff96e4a1ce5a40dcf6636c9c1f50 -o yaml|grep -B1 -A5 REVMRVRFIG1pdGlnYXRpb25zPWF1dG8
      - contents:
          source: data:text/plain;charset=utf-8;base64,REVMRVRFIG1pdGlnYXRpb25zPWF1dG8sbm9zbXQKQUREIHN5c3RlbWQudW5pZmllZF9jZ3JvdXBfaGllcmFyY2h5PTAK
        mode: 384
        overwrite: true
        path: /etc/pivot/kernel-args

oc debug node/w13
chroot /host

sh-5.1# cat /etc/pivot/kernel-args
DELETE mitigations=auto,nosmt
ADD systemd.unified_cgroup_hierarchy=0

sh-5.1# rpm-ostree kargs
console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu $ignition_firstboot ostree=/ostree/boot.1/fedora-coreos/549522e4a296d39371f7aac527b23be29f85899cf11f2b09aa7b3f988a18ae21/0 root=UUID=a8dd10c7-7340-400b-b784-0c10c82ae327 rw rootflags=prjquota

Is /etc/pivot/kernel-args supposed to work in "day 2" setup? Or should there be a manifest that uses machineconfig.spec.kernelArguments instead?

@vrutkovs
Copy link
Member

vrutkovs commented Jul 3, 2021

Is /etc/pivot/kernel-args supposed to work in "day 2" setup?

No, it would apply for new nodes only. You can tune kernel args on hosts manually - see rpm-ostree kargs command help

@vrutkovs
Copy link
Member

vrutkovs commented Jul 3, 2021

@spasche
Copy link
Author

spasche commented Jul 5, 2021

No, it would apply for new nodes only. You can tune kernel args on hosts manually - see rpm-ostree kargs command help

Ok, thanks for the clarification 👍

@mgamboa
Copy link

mgamboa commented Jul 10, 2021

Hi Still having same issue
error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/memory/memory.limit_in_bytes: no such file or directory

[root@controller custom]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.okd-2021-07-03-190901 True False 156m Cluster version is 4.7.0-0.okd-2021-07-03-190901

[root@controller custom]# oc get mc/rendered-worker-96cbb060223827560ba29638c4a8a409 -o yaml |grep -B1 -A5 REVMRVRFIG1pdGlnYXRpb25zPWF1dG
- contents:
source: data:text/plain;charset=utf-8;base64,REVMRVRFIG1pdGlnYXRpb25zPWF1dG8sbm9zbXQKQUREIHN5c3RlbWQudW5pZmllZF9jZ3JvdXBfaGllcmFyY2h5PTAK
mode: 384
overwrite: true
path: /etc/pivot/kernel-args
user:
name: root

[root@worker0 core]# rpm-ostree kargs
random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.0/rhcos/6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/0 root=UUID=57dc2f8b-17d9-45e9-82cf-c24748a73d58 rw rootflags=prjquota systemd.unified_cgroup_hierarchy=0

on the node into the /sys/fs/cgroup doesn't exist the folder memory
[root@worker0 cgroup]# pwd
/sys/fs/cgroup
drwxr-xr-x. 2 root root 0 Jul 10 06:28 -.mount
-r--r--r--. 1 root root 0 Jul 10 06:28 cgroup.controllers
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.max.depth
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.procs
-r--r--r--. 1 root root 0 Jul 10 06:28 cgroup.stat
-rw-r--r--. 1 root root 0 Jul 10 09:32 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.threads
-rw-r--r--. 1 root root 0 Jul 10 06:28 cpu.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 cpu.stat
-r--r--r--. 1 root root 0 Jul 10 06:28 cpuset.cpus.effective
-r--r--r--. 1 root root 0 Jul 10 06:28 cpuset.mems.effective
drwxr-xr-x. 2 root root 0 Jul 10 06:28 dev-hugepages.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 dev-mqueue.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 etc.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 init.scope
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.cost.model
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.cost.qos
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 io.stat
drwxr-xr-x. 4 root root 0 Jul 10 06:28 kubepods
drwxr-xr-x. 4 root root 0 Jul 10 06:28 kubepods.slice
drwxr-xr-x. 2 root root 0 Jul 10 06:28 machine.slice
-r--r--r--. 1 root root 0 Jul 10 06:28 memory.numa_stat
-rw-r--r--. 1 root root 0 Jul 10 06:28 memory.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 memory.stat
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-fs-fuse-connections.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-config.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-debug.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-tracing.mount
drwxr-xr-x. 333 root root 0 Jul 10 09:32 system.slice
drwxr-xr-x. 3 root root 0 Jul 10 09:31 user.slice
drwxr-xr-x. 2 root root 0 Jul 10 06:28 usr.mount

Just for testing i create the folder manually now i have the memory directory

drwxr-xr-x. 2 root root 0 Jul 10 06:28 -.mount
-r--r--r--. 1 root root 0 Jul 10 06:28 cgroup.controllers
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.max.depth
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.procs
-r--r--r--. 1 root root 0 Jul 10 06:28 cgroup.stat
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Jul 10 06:28 cgroup.threads
-rw-r--r--. 1 root root 0 Jul 10 06:28 cpu.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 cpu.stat
-r--r--r--. 1 root root 0 Jul 10 06:28 cpuset.cpus.effective
-r--r--r--. 1 root root 0 Jul 10 06:28 cpuset.mems.effective
drwxr-xr-x. 2 root root 0 Jul 10 06:28 dev-hugepages.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 dev-mqueue.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 etc.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 init.scope
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.cost.model
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.cost.qos
-rw-r--r--. 1 root root 0 Jul 10 06:28 io.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 io.stat
drwxr-xr-x. 4 root root 0 Jul 10 06:28 kubepods
drwxr-xr-x. 4 root root 0 Jul 10 06:28 kubepods.slice
drwxr-xr-x. 2 root root 0 Jul 10 06:28 machine.slice
drwxr-xr-x. 2 root root 0 Jul 10 09:33 memory
-r--r--r--. 1 root root 0 Jul 10 06:28 memory.numa_stat
-rw-r--r--. 1 root root 0 Jul 10 06:28 memory.pressure
-r--r--r--. 1 root root 0 Jul 10 06:28 memory.stat
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-fs-fuse-connections.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-config.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-debug.mount
drwxr-xr-x. 2 root root 0 Jul 10 06:28 sys-kernel-tracing.mount
drwxr-xr-x. 333 root root 0 Jul 10 09:32 system.slice
drwxr-xr-x. 3 root root 0 Jul 10 09:31 user.slice
drwxr-xr-x. 2 root root 0 Jul 10 06:28 usr.mount

just make cd memory and now i have some files inside of the directory but still missing the file of memory.limit_in_bytes
[root@worker0 memory]# ls -l
total 0
-r--r--r--. 1 root root 0 Jul 10 09:33 cgroup.controllers
-r--r--r--. 1 root root 0 Jul 10 09:33 cgroup.events
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.freeze
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.max.depth
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.procs
-r--r--r--. 1 root root 0 Jul 10 09:33 cgroup.stat
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.threads
-rw-r--r--. 1 root root 0 Jul 10 09:33 cgroup.type
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpu.max
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpu.pressure
-r--r--r--. 1 root root 0 Jul 10 09:33 cpu.stat
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpu.weight
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpu.weight.nice
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpuset.cpus
-r--r--r--. 1 root root 0 Jul 10 09:33 cpuset.cpus.effective
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpuset.cpus.partition
-rw-r--r--. 1 root root 0 Jul 10 09:33 cpuset.mems
-r--r--r--. 1 root root 0 Jul 10 09:33 cpuset.mems.effective
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.current
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.events
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.events.local
-rw-r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.max
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.rsvd.current
-rw-r--r--. 1 root root 0 Jul 10 09:33 hugetlb.1GB.rsvd.max
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.current
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.events
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.events.local
-rw-r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.max
-r--r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.rsvd.current
-rw-r--r--. 1 root root 0 Jul 10 09:33 hugetlb.2MB.rsvd.max
-rw-r--r--. 1 root root 0 Jul 10 09:33 io.bfq.weight
-rw-r--r--. 1 root root 0 Jul 10 09:33 io.latency
-rw-r--r--. 1 root root 0 Jul 10 09:33 io.max
-rw-r--r--. 1 root root 0 Jul 10 09:33 io.pressure
-r--r--r--. 1 root root 0 Jul 10 09:33 io.stat
-rw-r--r--. 1 root root 0 Jul 10 09:33 io.weight
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.current
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.events
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.events.local
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.high
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.low
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.max
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.min
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.numa_stat
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.oom.group
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.pressure
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.stat
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.swap.current
-r--r--r--. 1 root root 0 Jul 10 09:33 memory.swap.events
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.swap.high
-rw-r--r--. 1 root root 0 Jul 10 09:33 memory.swap.max
-r--r--r--. 1 root root 0 Jul 10 09:33 pids.current
-r--r--r--. 1 root root 0 Jul 10 09:33 pids.events
-rw-r--r--. 1 root root 0 Jul 10 09:33 pids.max

how I can make available the file in all the nodes , just compare with RHCOS in the OCP deployment and there the file exist and i don't have any issue i don't know why fcos is not including the memory folder and the memory limit file or okd is not sending to the node the correct configuration?

@vrutkovs
Copy link
Member

[root@worker0 core]# rpm-ostree kargs
random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal $ignition_firstboot ostree=/ostree/boot.0/rhcos/6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/0 root=UUID=57dc2f8b-17d9-45e9-82cf-c24748a73d58 rw rootflags=prjquota systemd.unified_cgroup_hierarchy=0

on the node into the /sys/fs/cgroup doesn't exist the folder memory

Does cat /proc/cmdline has systemd.unified_cgroup_hierarchy=0?

@mgamboa
Copy link

mgamboa commented Jul 11, 2021

[root@worker1 core]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/vmlinuz-5.12.7-300.fc34.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/0 root=UUID=9f9d16e8-ef76-4447-bad1-c219a98fd433 rw rootflags=prjquota

doesn't have it systemd.unified_cgroup_hierarchy=0

@mgamboa
Copy link

mgamboa commented Jul 11, 2021

Checking also the file
/etc/pivot/kernel-args
DELETE mitigations=auto,nosmt
ADD systemd.unified_cgroup_hierarchy=0

here is the argument

@mgamboa
Copy link

mgamboa commented Jul 11, 2021

as a workaround i just use right now the command
rpm-ostree kargs --append=systemd.unified_cgroup_hierarchy=0

but of course how we can implement from the okd machine config?

or we need to use the rpm-ostree kargs everytime for new nodes

@spasche
Copy link
Author

spasche commented Jul 12, 2021

but of course how we can implement from the okd machine config?

See my example MachineConfig in the description (i.e. use spec.kernelArguments).

Which version of OKD were you using when you created the impacted nodes?
This is supposed to be fixed for nodes created on version 4.7.0-0.okd-2021-07-03-190901 or later

@mgamboa
Copy link

mgamboa commented Jul 12, 2021

Well looks like is not working as you can see my version is 190901
root@controller custom]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.okd-2021-07-03-190901

@mgamboa
Copy link

mgamboa commented Jul 12, 2021

I was using the user provisioning infrastructure UPI for the new installation i don't know if that make any differences in the setup

@spasche
Copy link
Author

spasche commented Jul 12, 2021

What is relevant here is the version of the cluster you had when deploying the new nodes. Did you start a fresh installation of the cluster using 190901?

I was affected because I deployed the nodes when using version 4.7.0-0.okd-2021-06-13-090745 and then later updated the cluster.

@mgamboa
Copy link

mgamboa commented Jul 14, 2021

yes this was a fresh install of 190901 for that reason it's make weird doesn't apply the correct configuration when is already patch in the version

@mgamboa
Copy link

mgamboa commented Jul 14, 2021

I can confirm the issue i just deploy a new worker node and still no adding the systemd.unified_cgroup_hierarchy=0 on the kernel arguments
[root@worker3 core]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/vmlinuz-5.12.7-300.fc34.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/6291feade27c9cf562b8682be31eba789477cb65905d1e00adb65f0fcf096fde/0 root=UUID=9f9d16e8-ef76-4447-bad1-c219a98fd433 rw rootflags=prjquota

i make it work using rpm-ostree kargs --append=systemd.unified_cgroup_hierarchy=0

@vrutkovs
Copy link
Member

Please attach (or upload to the public file sharing service) must-gather archive for that cluster?

@mgamboa
Copy link

mgamboa commented Jul 19, 2021

you want the must-gather even is already patched or you want me to try install from scratch and send the must-gather?

@creativie
Copy link

Hi
Same issue. But my OKD 4.7 was upgraded from 4.5 and some config are still from the old version.

oc get mc  | grep mitigations
99-master-disable-mitigations                                                                 3.1.0             265d
99-okd-master-disable-mitigations                                                             3.1.0             22d
99-okd-worker-disable-mitigations                                                             3.1.0             22d
99-worker-disable-mitigations                                                                 3.1.0             265d
oc get mc 99-master-disable-mitigations -o jsonpath='{.spec.config.storage.files[0].contents.source}'|cut -d, -f2|base64 -d
DELETE mitigations=auto,nosmt
oc get mc 99-okd-master-disable-mitigations -o jsonpath='{.spec.config.storage.files[0].contents.source}'|cut -d, -f2|base64 -d
DELETE mitigations=auto,nosmt
ADD systemd.unified_cgroup_hierarchy=0

After deleting old configs, everything works fine.

@ssams
Copy link

ssams commented Oct 27, 2021

stumbled across this when troubleshooting a failing BuildConfig on our cluster, error message in the build pod was as posted in the original description in this comment:

error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/memory/memory.limit_in_bytes: no such file or directory

Our cluster was also installed as v4.5 a while ago and updated to 4.7 since then (bare metal). One of the compute nodes was replaced/reinstalled recently using a current CoreOS image, and thus apparently was using cgroupsv2, the kernel parameter was not present in the config of the fresh installation (it was present on the older/original compute nodes, hence the BuildConfig only failed when the build was run on the specific node, but not on others).

I can confirm the observation and suggested resolution posted by @creativie in #710 (comment), we also had old machine configs present and after deleting 99-master-disable-mitigations and 99-worker-disable-mitigations the operator reconfigured the nodes as expected and the missing kernel parameter is now present also on the newer installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants