fix delay on reboot or power off #859

bcressey · 2020-03-17T17:20:38Z

Issue number:
#858

Description of changes:
This reverts the KillMode=mixed change from 75125f6, and replaces it with lower timeouts for starting and stopping services.

Testing done:
Terminated an instance through the API.

New settings are applied:

# systemctl show
...
DefaultTimeoutStartUSec=10s
DefaultTimeoutStopUSec=10s

Relevant console output during shutdown from an instance with running pods and host containers:

[ 9655.638443] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 9655.643105] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 9655.649976] systemd-journald[1026]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 9665.650617] systemd-shutdown[1]: Waiting for process: containerd-shim, containerd-shim, containerd-shim, containerd-shim, containerd-shim, containerd-shim
[ 9665.659807] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 9665.665462] systemd-shutdown[1]: Sending SIGKILL to PID 3536 (containerd-shim).
[ 9665.671968] systemd-shutdown[1]: Sending SIGKILL to PID 3537 (containerd-shim).
[ 9665.678394] systemd-shutdown[1]: Sending SIGKILL to PID 4389 (containerd-shim).
[ 9665.684806] systemd-shutdown[1]: Sending SIGKILL to PID 4512 (containerd-shim).
[ 9665.691255] systemd-shutdown[1]: Sending SIGKILL to PID 4548 (containerd-shim).
[ 9665.697675] systemd-shutdown[1]: Sending SIGKILL to PID 4744 (containerd-shim).
[ 9665.705557] systemd-shutdown[1]: Unmounting file systems.
...
[ 9665.777562] systemd-shutdown[1]: Powering off.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

samuelkarp · 2020-03-17T17:33:51Z

Are we confident that this isn't going to cause the problem that 75125f6 was supposed to fix? In particular, it sounds like shutdown would block when there were containers that needed to exit and the containerd-shim processes weren't receiving appropriate signals.

Can you test with running containers (both host-containers and orchestrated containers) and make sure that shutdown does not block?

bcressey · 2020-03-17T18:04:52Z

In particular, it sounds like shutdown would block when there were containers that needed to exit and the containerd-shim processes weren't receiving appropriate signals.

Per testing output, all the containerd-shim processes are sent SIGKILL. The same happens with a dev build using Docker, if I have backgrounded Docker processes running at shutdown.

Mostly this results in the same behavior as before; with KillMode=mixed other processes are not sent the initial SIGTERM, only the follow-up SIGKILL. The advantage is that shim lifecycle is now decoupled from the containerd service, making it safer to restart it.

Can you test with running containers (both host-containers and orchestrated containers) and make sure that shutdown does not block?

That's indeed what I tested.

The shutdown timeout (and revert) was the original fix. I added the global stop timeout after observing that the libcontainer tasks would also delay shutdown by 90 seconds.

samuelkarp · 2020-03-17T18:54:29Z

Per testing output, all the containerd-shim processes are sent SIGKILL.

Thanks for verifying!

That's indeed what I tested.

It wasn't called out in the "Testing done" description, so I wanted to make sure.

packages/systemd/9004-core-add-separate-timeout-for-system-shutdown.patch

If we are still running processes during shutdown, they are likely to be running in containers rather than managed by the host system. They may not expect or respond to SIGTERM, which can delay the restart for up to 90 seconds. In the case of an update, we expect containers to be drained by the operator before the system is restarted. However, if the system is powered off directly then this may not happen. With the network down, it is unlikely that processes can complete any useful work apart from syncing data to disk. A lower timeout means we will reboot or power off more quickly, which allows the node or its replacement to come up faster. Signed-off-by: Ben Cressey <bcressey@amazon.com>

This reverts commit 75125f6. The reason for changing KillMode was to make the system shut down more quickly. However, this is flawed in practice because although the containerd shims are killed more quickly, the processes running inside the containers are not since they are no longer part of the unit's control group. This configuration is also discouraged by upstream, as it means that the containerd service cannot be safely restarted.

The default timeout for start and stop is 90 seconds, and none of the current host services require this much time. If they did, we could override the setting locally in the service unit. Many of the processes on the system end up running inside scopes that are dynamically created by the orchestrator agent. By changing the default timeout, we ensure that these processes are stopped quickly during shutdown or restart. Signed-off-by: Ben Cressey <bcressey@amazon.com>

bcressey requested review from iliana, etungsten and tjkirch March 17, 2020 17:21

bcressey force-pushed the shutdown-fix branch from 97f948b to 52e291f Compare March 17, 2020 17:56

etungsten approved these changes Mar 17, 2020

View reviewed changes

samuelkarp reviewed Mar 17, 2020

View reviewed changes

packages/systemd/9004-core-add-separate-timeout-for-system-shutdown.patch Show resolved Hide resolved

packages/systemd/9004-core-add-separate-timeout-for-system-shutdown.patch Outdated Show resolved Hide resolved

samuelkarp approved these changes Mar 17, 2020

View reviewed changes

iliana approved these changes Mar 17, 2020

View reviewed changes

bcressey force-pushed the shutdown-fix branch from 52e291f to 6eca20f Compare March 18, 2020 18:23

bcressey added 3 commits March 18, 2020 20:38

bcressey force-pushed the shutdown-fix branch from 6eca20f to 4781e3a Compare March 18, 2020 20:53

bcressey merged commit e849cc9 into bottlerocket-os:develop Mar 18, 2020

bcressey deleted the shutdown-fix branch March 18, 2020 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix delay on reboot or power off #859

fix delay on reboot or power off #859

bcressey commented Mar 17, 2020 •

edited

Loading

samuelkarp commented Mar 17, 2020

bcressey commented Mar 17, 2020

samuelkarp commented Mar 17, 2020

fix delay on reboot or power off #859

fix delay on reboot or power off #859

Conversation

bcressey commented Mar 17, 2020 • edited Loading

samuelkarp commented Mar 17, 2020

bcressey commented Mar 17, 2020

samuelkarp commented Mar 17, 2020

bcressey commented Mar 17, 2020 •

edited

Loading