Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodGC too slowly #13212

Open
3 of 4 tasks
shuangkun opened this issue Jun 18, 2024 · 1 comment
Open
3 of 4 tasks

PodGC too slowly #13212

shuangkun opened this issue Jun 18, 2024 · 1 comment
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug

Comments

@shuangkun
Copy link
Member

shuangkun commented Jun 18, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

My workflow have 8000 pod. The pods in the workflow succeeded within two minutes. However, it took 40 minutes to label all pods with the label workflows.argoproj.io/completed=true. I found that AddRateLimited was too slow to write to the queue.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-limit-1000-abcdefghijklmnfjladjflajdfla-
  labels:
    workflows.argoproj.io/archive-strategy: "always"
spec:
  entrypoint: parallelism-limit1
  parallelism: 2000
  hostNetwork: true
  templates:
  - name: parallelism-limit1
    steps:
    - - name: sleep
        template: sleep
        withSequence:
          start: "1"
          end: "8000"
    - - name: sleep2
        template: sleep2
  - name: sleep
    metadata:
      annotations:
        #k8s.aliyun.com/eci-spot-strategy: "SpotAsPriceGo"
        k8s.aliyun.com/eci-auto-imc: "true"
    container:
      imagePullPolicy: IfNotPresent
      image: docker/whalesay:latest
      command: [sh, -c, sleep 1]
      resources:
        requests:
          cpu: 0.25
          memory: 0.5Gi
  - name: sleep2
    metadata:
      annotations:
        #k8s.aliyun.com/eci-spot-strategy: "SpotAsPriceGo"
    container:
      imagePullPolicy: IfNotPresent
      image: docker/whalesay:latest
      command: [sh, -c, sleep 100000]
      resources:
        requests:
          cpu: 0.25
          memory: 0.5Gi

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Jun 18, 2024
@tooptoop4
Copy link
Contributor

maybe #13690 will help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more type/bug
Projects
None yet
Development

No branches or pull requests

3 participants