Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeoutStartSec in podman generate systemd #11618

Closed
w4tsn opened this issue Sep 16, 2021 · 11 comments · Fixed by #12380
Closed

TimeoutStartSec in podman generate systemd #11618

w4tsn opened this issue Sep 16, 2021 · 11 comments · Fixed by #12380
Assignees
Labels
In Progress This issue is actively being worked by the assignee, please do not work on this at this time. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@w4tsn
Copy link
Contributor

w4tsn commented Sep 16, 2021

/kind feature

Description

I noticed that podman generate systemd --new does not add a TimeoutStartSec= anymore (v3.3.1) leaving it up to the systems defaults configured in e.g. /etc/systemd/system.conf.

The default on Fedora Linux seems to be the systemd default of 90s. On low-performance devices, slow storage (SD cards) or in bad network conditions the startup of a container unit can take much more than 90s when first having to pull images. Especially when starting 4 to 6 containers at boot.

On a Raspberry Pi 4 quite overloaded with 6 containers and a boot load5 of 10 (it's usually around 2 to 3 when things have settled) it's impossible to get the containers started because systemd will kill them off while pulling their containers after 90s.

How do you think about this? What is the best practice or experience here?

Should the command include a larger TimeoutStartSec as it did in the past? Should it be added via an optional flag or just manually (with maybe a hint/notice in the docs) if a user knows that they want to operate on e.g. a Fedora ARM or IoT on a Raspberry Pi? Maybe also we should aim at setting different defaults on those platforms apart from that?

podman version

Version:      3.3.1
API Version:  3.3.1
Go Version:   go1.16.6
Built:        Mon Aug 30 22:45:47 2021
OS/Arch:      linux/arm64

podman system info

host:
  arch: arm64
  buildahVersion: 1.22.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-2.fc34.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: '
  cpus: 4
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: notourserver
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.13.14-200.fc34.aarch64
  linkmode: dynamic
  memFree: 705859584
  memTotal: 3994357760
  ociRuntime:
    name: crun
    package: crun-1.0-1.fc34.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc34.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 3994021888
  swapTotal: 3994021888
  uptime: 52m 5.32s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 4
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 4
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 1630356347
  BuiltTime: Mon Aug 30 22:45:47 2021
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/arm64
  Version: 3.3.1
@rhatdan
Copy link
Member

rhatdan commented Sep 16, 2021

@vrothberg PTAL

@Luap99
Copy link
Member

Luap99 commented Sep 17, 2021

The units use sdnotify now. Maybe podman should send EXTEND_TIMEOUT_USEC= when pulling images.

@w4tsn
Copy link
Contributor Author

w4tsn commented Sep 17, 2021

@Luap99 I think that would be a great solution! The longer startup period is usually only required when pulling images. Can EXTEND_TIMEOUT_USEC= be send multiple times / updated regularly? That way podman could dynamically update the required time until the image is downloaded which would be a more adaptive approach than to set a fixed value here.

I fiddled around with this a bit more yesterday and found that in my case downloading a 1.2G (uncompressed) homeassistant image took maybe 20 to 30 minutes. The download was rather fast, but the processing until the manifest was written to the image destination and the pull command completed took the vast majority of time.

I did some i/o monitoring and tests and found that the used industrial SD card in the Pi4 has R/W of roughly 20 - 40 MB/s. The sysstats showed that CPU utilization was at around 0% while load15 was 10 - 15 and iowait was like 80 to 90%. I'm not entirely sure but I assume that the pull command puts a lot of io pressure on the device / sd card and together with the card being rather slow the system almost halts on io wait.

I have two thoughts on this:

  • I'm under the impression that Raspbian and docker perform way better with the same hardware. I'm having such issues on raspberry pi + sd card for a long time now.
  • Maybe sd card + raspberry pi is just not fitted for podman containers :/

@rhatdan
Copy link
Member

rhatdan commented Sep 17, 2021

Woops wrong link.

@vrothberg
Copy link
Member

@w4tsn can you elaborate on why you think Docker performs better on your pi?

Personally, I think that we could make the TimeoutStartSec configurable for users. Pulling images for 20-30 minutes at boot sounds concerning.

Using EXTEND_TIMEOUT_USEC= sounds tempting but it may break existing users who rely on the units/services to be nuked.

@w4tsn
Copy link
Contributor Author

w4tsn commented Oct 19, 2021

@vrothberg hmm. Actually it's more a hunch than a measured fact when I think about it. I'm working with Raspberry Pis using industrial SD cards for a while now and around 2 - 3 years ago we switched over from Raspberry Pi OS to Fedora IoT and podman. Since then I've always experienced that the commit time of retrieving images takes a really long time. Now that I monitored the iowait on podman pull I know for a fact that with those SD Cards it's pretty bad but I didn't double check it by installing Raspberry Pi OS on our current Pis and SD cards (they might have changed over the years) and doing docker pull operations. I'll do this to verify that it is indeed a difference in docker and podman

I've tested with Raspberry Pi OS on same hardware (Pi + SD card) and I'm seeing that iowait is around 50% - 75% while system load is at 5 and it takes around 10 - 15 minutes to download the homeassisstant image. That's also quite a lot and I think that we might have switched to more rugged but really slow SD cards in the past. Nevertheless podman seems to take a bit longer and puts higher iowait on the system.

Apart from that we don't observe this problem on the CM4 with eMMC, which is significantly faster than the SD cards.

While I still think it's useful to control these timeout settings using a proper storage with reasonable read/write performance reduces the problem significantly.

On a side note: using quadlet this should already be configurable

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 19, 2021

@vrothberg @w4tsn What is up with this issue?

@vrothberg
Copy link
Member

I think of adding a --start-timeout to podman generate systemd. What do you think?

@w4tsn
Copy link
Contributor Author

w4tsn commented Nov 19, 2021

I think a --start-timeout option is reasonable as it also documents that this could be an issue / something you might want to configure.

@rhatdan
Copy link
Member

rhatdan commented Nov 19, 2021

SGTM

@vrothberg vrothberg self-assigned this Nov 22, 2021
@vrothberg vrothberg added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label Nov 22, 2021
vrothberg added a commit to vrothberg/libpod that referenced this issue Nov 22, 2021
Add a new flag to set the start timeout for a generated systemd unit.
To make naming consistent, add a new --stop-timeout flag as well and let
the previous --time map to it.

Fixes: containers#11618
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Nov 22, 2021
Add a new flag to set the start timeout for a generated systemd unit.
To make naming consistent, add a new --stop-timeout flag as well and let
the previous --time map to it.

Fixes: containers#11618
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Nov 22, 2021
Add a new flag to set the start timeout for a generated systemd unit.
To make naming consistent, add a new --stop-timeout flag as well and let
the previous --time map to it.

Fixes: containers#11618
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Nov 23, 2021
Add a new flag to set the start timeout for a generated systemd unit.
To make naming consistent, add a new --stop-timeout flag as well and let
the previous --time map to it.

Fixes: containers#11618
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
In Progress This issue is actively being worked by the assignee, please do not work on this at this time. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants