Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include systemd-logind for inhibit functionality #3308

Merged
merged 1 commit into from
Aug 7, 2023

Conversation

stmcginnis
Copy link
Contributor

Issue number:

Closes #3305

Description of changes:

Settings were added for kubelet to be configured with pod shutdown grace policy settings. This feature allows kubelet to hold off a system shutdown until it has had a chance to terminate and migrate pod workloads cleanly.

Unfortunately, it was missed when these settings were added that the mechanism kubelet uses internally for this is to use systemd-inhibit that is part of systemd-logind. Without the systemd-logind functionality included in Bottlerocket, the attempt to start processes with inhibit protection would fail and shutdown calls would be immediate.

This adds systemd-logind to the systemd build process to enable this functionality, excluding things like PAM and its CLI since those are not needed for our inhibit requirements.

Testing done:

Checked initial inhibit list:

bash-5.1# systemd-inhibit --list
No inhibitors.

Configured Kubernetes graceful shutdown periods:

bash-5.1# apiclient apply <<EOF
> [settings.kubernetes]
> shutdown-grace-period = "15m"          
> shutdown-grace-period-for-critical-pods = "10m"
> EOF

Checked kubelet service status and checked logs to verify the node shutdown manager picked up settings without any errors:

Jul 31 17:58:22 ip-192-168-93-85.us-east-2.compute.internal kubelet[1179]: I0731 17:58:22.212698    1179 nodeshutdown_manager_linux.go:137] "Creating node shutdown manager
" shutdownGracePeriodRequested="15m0s" shutdownGracePeriodCriticalPods="10m0s" shutdownGracePeriodByPodPriority=[{Priority:0 ShutdownGracePeriodSeconds:300} {Priority:2000
000000 ShutdownGracePeriodSeconds:600}]

Listed inhibited processes again and verified kubelet shows up:

bash-5.1# systemd-inhibit --list                                                                                                             
WHO     UID USER PID  COMM    WHAT     WHY                                        MODE
kubelet 0   root 4356 kubelet shutdown Kubelet needs time to handle node shutdown delay

1 inhibitors listed.

Rebooted one of the nodes and made sure there were no errors reported.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you diff the features and options output for this? The build log will dump out a config like this:

#20 14.08 systemd 250  
#20 14.08·  
#20 14.08     build mode                     : release  
#20 14.08     split /usr                     : False  
#20 14.08     split bin-sbin                 : True 
...

I'd also like to see a comparison of the package contents (rpm -qlp build/rpms/...) before and after to understand all the new units that are coming in.

packages/systemd/prepare-var-lib-systemd-linger.service Outdated Show resolved Hide resolved
packages/systemd/systemd.spec Outdated Show resolved Hide resolved
@stmcginnis
Copy link
Contributor Author

stmcginnis commented Jul 31, 2023

Still collecting data, but here is the RPM difference after this change:

$ diff bottlerocket-systemd-250.11-1.output bottlerocket-systemd-250.11-1-updated.output 
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-inhibit
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/dbus-org.freedesktop.login1.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/multi-user.target.wants/systemd-logind.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/prepare-var-lib-systemd-linger.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-logind.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-logind
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-user-runtime-dir
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-power-switch.rules
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-uaccess.rules
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/71-seat.rules
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/73-seat-late.rules
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system-services/org.freedesktop.login1.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system.d/org.freedesktop.login1.conf

@stmcginnis
Copy link
Contributor Author

Diff of features and options section:

$ diff trimmed-systemd-current.output trimmed-systemd-updated.output 
1d0
< 
64,65c63,64
<     enabled                        : ACL, SECCOMP, SELinux, blkid, libfdisk, efi, networkd, pstore, randomseed, repart, systemd-analyze, sysusers, tmpfiles, kmod, ldconfig, gshadow, link-udev-shared, link-systemctl-shared
<     disabled                       : AUDIT, AppArmor, IMA, PAM, SMACK, elfutils, gcrypt, gnutls, libbpf, libcryptsetup, libcryptsetup-plugins, libcurl, libfido2, libidn, libidn2, libiptc, microhttpd, openssl, p11kit, pcre2, pwquality, qrencode, tpm2, xkbcommon, zstd, lz4, xz, zlib, bzip2, backlight, binfmt, bpf-framework, coredump, environment.d, gnu-efi, firstboot, hibernate, homed, hostnamed, hwdb, importd, initrd, kernel-install, localed, logind, machined, nss-myhostname, nss-mymachines, nss-resolve, nss-systemd, oomd, portabled, quotacheck, resolve, rfkill, sysext, timedated, timesyncd, userdb, vconsole, xdg-autostart, idn, polkit, nscd, legacy-pkla, dbus, glib, tpm, man pages, html pages, man page indices, SysV compat, compat-mutable-uid-boundaries, utmp, adm group, wheel group, debug hashmap, debug mmap cache, debug siphash, valgrind, trace logging, install tests, link-networkd-shared, link-timesyncd-shared, link-boot-shared, fexecve, standalone-binaries, static-libsystemd, static-libudev, cryptolib, DNS-over-TLS
---
>     enabled                        : ACL, SECCOMP, SELinux, blkid, libfdisk, efi, logind, networkd, pstore, randomseed, repart, systemd-analyze, sysusers, tmpfiles, kmod, ldconfig, gshadow, link-udev-shared, link-systemctl-shared
>     disabled                       : AUDIT, AppArmor, IMA, PAM, SMACK, elfutils, gcrypt, gnutls, libbpf, libcryptsetup, libcryptsetup-plugins, libcurl, libfido2, libidn, libidn2, libiptc, microhttpd, openssl, p11kit, pcre2, pwquality, qrencode, tpm2, xkbcommon, zstd, lz4, xz, zlib, bzip2, backlight, binfmt, bpf-framework, coredump, environment.d, gnu-efi, firstboot, hibernate, homed, hostnamed, hwdb, importd, initrd, kernel-install, localed, machined, nss-myhostname, nss-mymachines, nss-resolve, nss-systemd, oomd, portabled, quotacheck, resolve, rfkill, sysext, timedated, timesyncd, userdb, vconsole, xdg-autostart, idn, polkit, nscd, legacy-pkla, dbus, glib, tpm, man pages, html pages, man page indices, SysV compat, compat-mutable-uid-boundaries, utmp, adm group, wheel group, debug hashmap, debug mmap cache, debug siphash, valgrind, trace logging, install tests, link-networkd-shared, link-timesyncd-shared, link-boot-shared, fexecve, standalone-binaries, static-libsystemd, static-libudev, cryptolib, DNS-over-TLS
142c141
<     logind                         : false
---
>     logind                         : true

@stmcginnis
Copy link
Contributor Author

Updated to address comments. After some trial and error and testing, new rpm file diff:

> /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-inhibit
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/dbus-org.freedesktop.login1.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/multi-user.target.wants/systemd-logind.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/prepare-var-lib-systemd-linger.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-logind.service
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-logind
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-power-switch.rules
> /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system.d/org.freedesktop.login1.conf

Same testing as mentioned in original description.

Just for more data, here's what the systemd-logind process looks like:

bash-5.1# systemd-inhibit --list
WHO     UID USER PID  COMM    WHAT     WHY                                        MODE
kubelet 0   root 4935 kubelet shutdown Kubelet needs time to handle node shutdown delay

1 inhibitors listed.
bash-5.1# cat /proc/1185/status                                                                                                                 
Name:   systemd-logind
Umask:  0022
State:  S (sleeping)
Tgid:   1185
Ngid:   0
Pid:    1185
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 128
Groups:  
NStgid: 1185
NSpid:  1185
NSpgid: 1185
NSsid:  1185
VmPeak:     9660 kB
VmSize:     9628 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      5836 kB
VmRSS:      5836 kB
RssAnon:             608 kB
RssFile:            5228 kB
RssShmem:              0 kB
VmData:      740 kB
VmStk:       132 kB
VmExe:       144 kB
VmLib:      5424 kB
VmPTE:        60 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        1
SigQ:   0/30501
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000200014003
SigIgn: 0000000400001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 000000024420020f
CapEff: 000000024420020f
CapBnd: 000000024420020f
CapAmb: 0000000000000000
NoNewPrivs:     1
Seccomp:        2
Seccomp_filters:        29
Speculation_Store_Bypass:       vulnerable
SpeculationIndirectBranch:      always enabled
Cpus_allowed:   3
Cpus_allowed_list:      0-1
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        683
nonvoluntary_ctxt_switches:     67

@stmcginnis
Copy link
Contributor Author

stmcginnis commented Aug 3, 2023

Rebased on the latest develop to pick up the systemd update to 252.

Had previously tested with the combined changes, but did another validation test with newest build:

bash-5.1# systemd-inhibit --list
No inhibitors.
bash-5.1# apiclient apply <<EOF
> [settings.kubernetes]
> shutdown-grace-period = "15m"          
> shutdown-grace-period-for-critical-pods = "10m"
> EOF

bash-5.1# systemd-inhibit --list                                                                                                                       
WHO     UID USER PID  COMM    WHAT     WHY                                        MODE
kubelet 0   root 3954 kubelet shutdown Kubelet needs time to handle node shutdown delay

1 inhibitors listed.
bash-5.1# journalctl -u kubelet | grep shutdown
Aug 03 20:54:51 ip-192-168-63-177.us-east-2.compute.internal kubelet[3954]: I0803 20:54:51.726760    3954 nodeshutdown_manager_linux.go:137] "Creating node shutdown manager" shutdownGracePeriodRequested="15m0s" shutdownGracePeriodCriticalPods="10m0s" shutdownGracePeriodByPodPriority=[{Priority:0 ShutdownGracePeriodSeconds:300} {Priority:2000000000 ShutdownGracePeriodSeconds:600}]

@arnaldo2792
Copy link
Contributor

Do you mind sharing another diff of the rpm after the rebase? (diff -u will be more readable)

@stmcginnis
Copy link
Contributor Author

Here is the diff with -u:

--- systemd-base.txt    2023-08-04 14:09:34.576651598 +0000
+++ systemd-inhibit.txt 2023-08-04 14:09:45.592607534 +0000
@@ -11,6 +11,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-dissect
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-escape
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-id128
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-inhibit
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-machine-id-setup
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-mount
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/bin/systemd-notify
@@ -75,6 +76,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/bluetooth.target
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/boot-complete.target
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/ctrl-alt-del.target
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/dbus-org.freedesktop.login1.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/dev-hugepages.mount
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/dev-mqueue.mount
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/exit.target
@@ -96,6 +98,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/modprobe@.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/multi-user.target.wants
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/multi-user.target.wants/getty.target
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/multi-user.target.wants/systemd-logind.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/network-online.target
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/network-pre.target
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/network.target
@@ -173,6 +176,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-journald@.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-journald@.socket
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-kexec.service
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-logind.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-machine-id-commit.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-modules-load.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/systemd-network-generator.service
@@ -220,6 +224,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-fsck
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-growfs
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-journald
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-logind
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-makefs
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-modules-load
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/systemd-network-generator
@@ -302,6 +307,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-joystick.rules
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-memory.rules
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-mouse.rules
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-power-switch.rules
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/70-touchpad.rules
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/75-net-description.rules
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/udev/rules.d/75-probe_mtd.rules
@@ -323,6 +329,7 @@
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system-services
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system-services/org.freedesktop.systemd1.service
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system.d
+/x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system.d/org.freedesktop.login1.conf
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/dbus-1/system.d/org.freedesktop.systemd1.conf
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/factory
 /x86_64-bottlerocket-linux-gnu/sys-root/usr/share/factory/etc/issue

packages/systemd/systemd.spec Outdated Show resolved Hide resolved
packages/systemd/systemd.spec Show resolved Hide resolved
Settings were added for kubelet to be configured with pod shutdown grace
policy settings. This feature allows kubelet to hold off a system
shutdown until it has had a chance to terminate and migrate pod
workloads cleanly.

Unfortunately, it was missed when these settings were added that the
mechanism kubelet uses internally for this is to use systemd-inhibit
that is part of systemd-logind. Without the systemd-logind functionality
included in Bottlerocket, the attempt to start processes with inhibit
protection would fail and shutdown calls would be immediate.

This adds systemd-logind to the systemd build process to enable this
functionality, excluding things like PAM and its CLI since those are not
needed for our inhibit requirements.

Signed-off-by: Sean McGinnis <stmcg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes graceful shutdown not working as expected
3 participants