TimeoutStartSec in podman generate systemd #11618

w4tsn · 2021-09-16T20:24:29Z

/kind feature

Description

I noticed that podman generate systemd --new does not add a TimeoutStartSec= anymore (v3.3.1) leaving it up to the systems defaults configured in e.g. /etc/systemd/system.conf.

The default on Fedora Linux seems to be the systemd default of 90s. On low-performance devices, slow storage (SD cards) or in bad network conditions the startup of a container unit can take much more than 90s when first having to pull images. Especially when starting 4 to 6 containers at boot.

On a Raspberry Pi 4 quite overloaded with 6 containers and a boot load5 of 10 (it's usually around 2 to 3 when things have settled) it's impossible to get the containers started because systemd will kill them off while pulling their containers after 90s.

How do you think about this? What is the best practice or experience here?

Should the command include a larger TimeoutStartSec as it did in the past? Should it be added via an optional flag or just manually (with maybe a hint/notice in the docs) if a user knows that they want to operate on e.g. a Fedora ARM or IoT on a Raspberry Pi? Maybe also we should aim at setting different defaults on those platforms apart from that?

podman version

Version:      3.3.1
API Version:  3.3.1
Go Version:   go1.16.6
Built:        Mon Aug 30 22:45:47 2021
OS/Arch:      linux/arm64

podman system info

host:
  arch: arm64
  buildahVersion: 1.22.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-2.fc34.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: '
  cpus: 4
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: notourserver
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.13.14-200.fc34.aarch64
  linkmode: dynamic
  memFree: 705859584
  memTotal: 3994357760
  ociRuntime:
    name: crun
    package: crun-1.0-1.fc34.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc34.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 3994021888
  swapTotal: 3994021888
  uptime: 52m 5.32s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 4
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 4
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 1630356347
  BuiltTime: Mon Aug 30 22:45:47 2021
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/arm64
  Version: 3.3.1

The text was updated successfully, but these errors were encountered:

rhatdan · 2021-09-16T21:35:01Z

@vrothberg PTAL

Luap99 · 2021-09-17T09:06:19Z

The units use sdnotify now. Maybe podman should send EXTEND_TIMEOUT_USEC= when pulling images.

w4tsn · 2021-09-17T09:32:00Z

@Luap99 I think that would be a great solution! The longer startup period is usually only required when pulling images. Can EXTEND_TIMEOUT_USEC= be send multiple times / updated regularly? That way podman could dynamically update the required time until the image is downloaded which would be a more adaptive approach than to set a fixed value here.

I fiddled around with this a bit more yesterday and found that in my case downloading a 1.2G (uncompressed) homeassistant image took maybe 20 to 30 minutes. The download was rather fast, but the processing until the manifest was written to the image destination and the pull command completed took the vast majority of time.

I did some i/o monitoring and tests and found that the used industrial SD card in the Pi4 has R/W of roughly 20 - 40 MB/s. The sysstats showed that CPU utilization was at around 0% while load15 was 10 - 15 and iowait was like 80 to 90%. I'm not entirely sure but I assume that the pull command puts a lot of io pressure on the device / sd card and together with the card being rather slow the system almost halts on io wait.

I have two thoughts on this:

I'm under the impression that Raspbian and docker perform way better with the same hardware. I'm having such issues on raspberry pi + sd card for a long time now.
Maybe sd card + raspberry pi is just not fitted for podman containers :/

rhatdan · 2021-09-17T14:01:39Z

Woops wrong link.

vrothberg · 2021-09-27T12:36:16Z

@w4tsn can you elaborate on why you think Docker performs better on your pi?

Personally, I think that we could make the TimeoutStartSec configurable for users. Pulling images for 20-30 minutes at boot sounds concerning.

Using EXTEND_TIMEOUT_USEC= sounds tempting but it may break existing users who rely on the units/services to be nuked.

w4tsn · 2021-10-19T10:56:56Z

@vrothberg hmm. Actually it's more a hunch than a measured fact when I think about it. I'm working with Raspberry Pis using industrial SD cards for a while now and around 2 - 3 years ago we switched over from Raspberry Pi OS to Fedora IoT and podman. Since then I've always experienced that the commit time of retrieving images takes a really long time. Now that I monitored the iowait on podman pull I know for a fact that with those SD Cards it's pretty bad but I didn't double check it by installing Raspberry Pi OS on our current Pis and SD cards (they might have changed over the years) and doing docker pull operations. I'll do this to verify that it is indeed a difference in docker and podman

I've tested with Raspberry Pi OS on same hardware (Pi + SD card) and I'm seeing that iowait is around 50% - 75% while system load is at 5 and it takes around 10 - 15 minutes to download the homeassisstant image. That's also quite a lot and I think that we might have switched to more rugged but really slow SD cards in the past. Nevertheless podman seems to take a bit longer and puts higher iowait on the system.

Apart from that we don't observe this problem on the CM4 with eMMC, which is significantly faster than the SD cards.

While I still think it's useful to control these timeout settings using a proper storage with reasonable read/write performance reduces the problem significantly.

On a side note: using quadlet this should already be configurable

github-actions · 2021-11-19T00:04:08Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2021-11-19T15:52:21Z

@vrothberg @w4tsn What is up with this issue?

vrothberg · 2021-11-19T15:57:19Z

I think of adding a --start-timeout to podman generate systemd. What do you think?

w4tsn · 2021-11-19T16:03:30Z

I think a --start-timeout option is reasonable as it also documents that this could be an issue / something you might want to configure.

rhatdan · 2021-11-19T16:56:30Z

SGTM

Add a new flag to set the start timeout for a generated systemd unit. To make naming consistent, add a new --stop-timeout flag as well and let the previous --time map to it. Fixes: containers#11618 Signed-off-by: Valentin Rothberg <rothberg@redhat.com>

rhatdan mentioned this issue Sep 17, 2021

Add support for retrieving system service --timeout #11630

Merged

github-actions bot added the stale-issue label Nov 19, 2021

github-actions bot removed the stale-issue label Nov 20, 2021

vrothberg self-assigned this Nov 22, 2021

vrothberg added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label Nov 22, 2021

vrothberg mentioned this issue Nov 22, 2021

generate systemd: add --start-timeout flag #12380

Merged

openshift-merge-robot closed this as completed in #12380 Nov 23, 2021

keszybz mentioned this issue Apr 26, 2023

quadlet fails with a timeout when starting #18353

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeoutStartSec in podman generate systemd #11618

TimeoutStartSec in podman generate systemd #11618

w4tsn commented Sep 16, 2021

rhatdan commented Sep 16, 2021

Luap99 commented Sep 17, 2021

w4tsn commented Sep 17, 2021

rhatdan commented Sep 17, 2021

vrothberg commented Sep 27, 2021

w4tsn commented Oct 19, 2021 •

edited

Loading

github-actions bot commented Nov 19, 2021

rhatdan commented Nov 19, 2021

vrothberg commented Nov 19, 2021

w4tsn commented Nov 19, 2021

rhatdan commented Nov 19, 2021

TimeoutStartSec in podman generate systemd #11618

TimeoutStartSec in podman generate systemd #11618

Comments

w4tsn commented Sep 16, 2021

rhatdan commented Sep 16, 2021

Luap99 commented Sep 17, 2021

w4tsn commented Sep 17, 2021

rhatdan commented Sep 17, 2021

vrothberg commented Sep 27, 2021

w4tsn commented Oct 19, 2021 • edited Loading

github-actions bot commented Nov 19, 2021

rhatdan commented Nov 19, 2021

vrothberg commented Nov 19, 2021

w4tsn commented Nov 19, 2021

rhatdan commented Nov 19, 2021

w4tsn commented Oct 19, 2021 •

edited

Loading