-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd.unified_cgroup_hierarchy=0 kernel argument missing on new nodes #710
Comments
Check that
This was resolved in 06-19-191547
Remove confidential information from must-gather and upload it to any public cloud service? |
Hi Vadim, Thanks for the quick response. Indeed, the machine config is missing the kernel argument:
But the cluster is at version 06-19-191547:
I was wondering if that would strip too much information. Here's the archive with some of the files removed: https://1drv.ms/u/s!AloPbwYP-ZZVhYANQM67l38sGsSUMg?e=1gWxqh |
Seems CVO is not tracking manifests from Workaround: apply 99-okd-master-disable-mitigations and 99-okd-worker-disable-mitigations manually |
Thanks, I'll do that 👍 |
Lets keep this open - the fix for this landed in 4.8/4.7 nightlies, but didn't make it to stable just yet |
I'm still facing the issue. The machineconfig creates the
Is |
No, it would apply for new nodes only. You can tune kernel args on hosts manually - see |
Ok, thanks for the clarification 👍 |
Hi Still having same issue [root@controller custom]# oc get clusterversion [root@controller custom]# oc get mc/rendered-worker-96cbb060223827560ba29638c4a8a409 -o yaml |grep -B1 -A5 REVMRVRFIG1pdGlnYXRpb25zPWF1dG [root@worker0 core]# rpm-ostree kargs on the node into the /sys/fs/cgroup doesn't exist the folder memory Just for testing i create the folder manually now i have the memory directory drwxr-xr-x. 2 root root 0 Jul 10 06:28 -.mount just make cd memory and now i have some files inside of the directory but still missing the file of memory.limit_in_bytes how I can make available the file in all the nodes , just compare with RHCOS in the OCP deployment and there the file exist and i don't have any issue i don't know why fcos is not including the memory folder and the memory limit file or okd is not sending to the node the correct configuration? |
Does |
[root@worker1 core]# cat /proc/cmdline doesn't have it systemd.unified_cgroup_hierarchy=0 |
Checking also the file here is the argument |
as a workaround i just use right now the command but of course how we can implement from the okd machine config? or we need to use the rpm-ostree kargs everytime for new nodes |
See my example MachineConfig in the description (i.e. use Which version of OKD were you using when you created the impacted nodes? |
Well looks like is not working as you can see my version is 190901 |
I was using the user provisioning infrastructure UPI for the new installation i don't know if that make any differences in the setup |
What is relevant here is the version of the cluster you had when deploying the new nodes. Did you start a fresh installation of the cluster using 190901? I was affected because I deployed the nodes when using version 4.7.0-0.okd-2021-06-13-090745 and then later updated the cluster. |
yes this was a fresh install of 190901 for that reason it's make weird doesn't apply the correct configuration when is already patch in the version |
I can confirm the issue i just deploy a new worker node and still no adding the systemd.unified_cgroup_hierarchy=0 on the kernel arguments i make it work using rpm-ostree kargs --append=systemd.unified_cgroup_hierarchy=0 |
Please attach (or upload to the public file sharing service) must-gather archive for that cluster? |
you want the must-gather even is already patched or you want me to try install from scratch and send the must-gather? |
Hi
After deleting old configs, everything works fine. |
stumbled across this when troubleshooting a failing BuildConfig on our cluster, error message in the build pod was as posted in the original description in this comment:
Our cluster was also installed as v4.5 a while ago and updated to 4.7 since then (bare metal). One of the compute nodes was replaced/reinstalled recently using a current CoreOS image, and thus apparently was using cgroupsv2, the kernel parameter was not present in the config of the fresh installation (it was present on the older/original compute nodes, hence the BuildConfig only failed when the build was run on the specific node, but not on others). I can confirm the observation and suggested resolution posted by @creativie in #710 (comment), we also had old machine configs present and after deleting |
Describe the bug
I deployed new nodes on a OKD cluster with user provisioned infrastructure using libvirt KVM virtual machines.
Cluster was at version 4.7.0-0.okd-2021-06-13-090745 when new nodes were added.
On the new nodes, we could see pods failing. For instance, BuildConfig pods were failing with the error:
After investigations, I noticed that the new nodes were missing the
systemd.unified_cgroup_hierarchy=0
kernel boot parameter, which exposes/sys/fs/cgroup/memory/
.rpm-ostree status on the new nodes:
As a workaround, I deployed the following MachineConfig:
Version
4.7.0-0.okd-2021-06-13-090745 when the new nodes were added.
User provisioned infrastructure using libvirt KVM virtual machines.
Cluster later updated to 4.7.0-0.okd-2021-06-19-191547 (which didn't fix the issue).
How reproducible
The same issue happened on the 5 nodes that were added.
Log bundle
I can't attach the link here for confidentiality reasons. I'll happily send the link by email.
The text was updated successfully, but these errors were encountered: