Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: fix flaky image gc and mysql connect + add k3s debug log #13660

Merged
merged 6 commits into from
Sep 26, 2024

Conversation

MasonM
Copy link
Contributor

@MasonM MasonM commented Sep 25, 2024

Motivation

A bunch of tests have been failing intermittently for awhile now, which is blocking PRs. For example, the E2E Tests (test-executor, v1.28.13+k3s1, minimal, false) test fails >50% of the time with the error ErrImageNeverPull: Container image "quay.io/argoproj/argocli:latest" is not present with pull policy of Never. (example run). Many of these issues can't be diagnosed without access to the k3s logs.

Modifications

  1. Print the k3s logs using journalctl on a build failure. You can see k3s is being run with systemd from the output of the Install and start K3S step:
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
  1. Add a step to clean up disk space by deleting unused files (e.g. Haskell), since the k3s logs clearly indicate pods are being evicted due to DiskPressure
  2. Update make wait PROFILE=mysql to wait for mysql to be available, which should fix intermittent failures like persistence.go:29: test panicked: dial tcp [::1]:3306: connect: connection refused
  3. Add a Kubelet configuration file to tell k3s not to garbage collect images, since that's what's causing the ErrImageNeverPull errors

Verification

Will watch action output

The `E2E Tests (test-executor, v1.28.13+k3s1, minimal, false)` test has
been flaky for awhile and keeps failing with the error
`ErrImageNeverPull: Container image "quay.io/argoproj/argocli:latest" is
not present with pull policy of Never.

This shouldn't be happening because k3s should be using cri-dockerd as
the container runtime and the "Load images" step handles loading that
image into Docker. There were changes to cri-dockerd recently
(Mirantis/cri-dockerd#373) that might be
related, but it's impossible to tell without the logs.

Signed-off-by: Mason Malone <mmalone@adobe.com>
@agilgur5
Copy link
Member

agilgur5 commented Sep 25, 2024

This is a follow-up to this Slack thread where Mason correlated this to #13600 being merged

@meln5674 mentioned this might be due to image GC in #13641 (comment). Seems like you saw that in the logs too per the thread:

2024-09-25T02:54:13.4531309Z Sep 25 02:47:55 fv-az1240-855 k3s[2120]: E0925 02:47:55.910731    2120 kubelet.go:1427] "Image garbage collection failed multiple times in a row" err="Failed to garbage collect required amount of images. Attempted to free 9859177676 bytes, but only found 4090859086 bytes eligible to free."

@agilgur5 agilgur5 added the area/build Build or GithubAction/CI issues label Sep 25, 2024
@MasonM MasonM force-pushed the ci-k3s-logs branch 2 times, most recently from 9d5b579 to e368b6b Compare September 25, 2024 03:55
Signed-off-by: Mason Malone <mmalone@adobe.com>
Signed-off-by: Mason Malone <mmalone@adobe.com>
Signed-off-by: Mason Malone <mmalone@adobe.com>
Signed-off-by: Mason Malone <mmalone@adobe.com>
Signed-off-by: Mason Malone <mmalone@adobe.com>
@MasonM MasonM changed the title ci: print k3s logs on failure ci: fix flaky tests and improve debugging Sep 25, 2024
@MasonM MasonM marked this pull request as ready for review September 25, 2024 06:30
Makefile Show resolved Hide resolved
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down Mason!

Approving this so we can unblock CI, but I do have some small comments below that we can resolve in a follow-up PR

.github/workflows/ci-build.yaml Show resolved Hide resolved
.github/workflows/ci-build.yaml Show resolved Hide resolved
.github/workflows/ci-build.yaml Show resolved Hide resolved
@agilgur5 agilgur5 merged commit bf64aeb into argoproj:main Sep 26, 2024
34 checks passed
@Joibel
Copy link
Member

Joibel commented Sep 26, 2024

@MasonM thanks for finding and fixing this.

MasonM added a commit to MasonM/argo-workflows that referenced this pull request Sep 27, 2024
This addresses the comments from
argoproj#13660. Also, it
hopefully fixes the flaky `CI / Windows Unit Tests (pull_request)` test
suite. The errors indicate it's trying to write temp files to `/tmp`:
```
    --- FAIL: TestArtifactoryArtifactDriver_Load/Found (0.00s)
        http_test.go:75:
            	Error Trace:	D:/a/argo-workflows/argo-workflows/workflow/artifacts/http/http_test.go:75
            	Error:      	Received unexpected error:
            	            	open /tmp/found: The system cannot find the path specified.
            	Test:       	TestArtifactoryArtifactDriver_Load/Found
```

which obviously isn't the right directory under Windows, but the test
does pass sometimes, and it seems like writing to the wrong directory
would cause consistent failures. Regardless, the tests should be using
`os.CreateTemp()` for this anyway.

Signed-off-by: Mason Malone <mmalone@adobe.com>
MasonM added a commit to MasonM/argo-workflows that referenced this pull request Sep 27, 2024
This addresses the comments from
argoproj#13660. Also, it
hopefully fixes the flaky `CI / Windows Unit Tests (pull_request)` test
suite. The errors indicate it's trying to write temp files to `/tmp`:
```
    --- FAIL: TestArtifactoryArtifactDriver_Load/Found (0.00s)
        http_test.go:75:
            	Error Trace:	D:/a/argo-workflows/argo-workflows/workflow/artifacts/http/http_test.go:75
            	Error:      	Received unexpected error:
            	            	open /tmp/found: The system cannot find the path specified.
            	Test:       	TestArtifactoryArtifactDriver_Load/Found
```

which obviously isn't the right directory under Windows, but the test
does pass sometimes, and it seems like writing to the wrong directory
would cause consistent failures. Regardless, the tests should be using
`os.CreateTemp()` for this anyway.

Signed-off-by: Mason Malone <mmalone@adobe.com>
MasonM added a commit to MasonM/argo-workflows that referenced this pull request Sep 27, 2024
This addresses the comments from
argoproj#13660. Also, it
hopefully fixes the flaky `CI / Windows Unit Tests (pull_request)` test
suite. The errors indicate it's trying to write temp files to `/tmp`:
```
    --- FAIL: TestArtifactoryArtifactDriver_Load/Found (0.00s)
        http_test.go:75:
            	Error Trace:	D:/a/argo-workflows/argo-workflows/workflow/artifacts/http/http_test.go:75
            	Error:      	Received unexpected error:
            	            	open /tmp/found: The system cannot find the path specified.
            	Test:       	TestArtifactoryArtifactDriver_Load/Found
```

which obviously isn't the right directory under Windows, but the test
does pass sometimes, and it seems like writing to the wrong directory
would cause consistent failures. Regardless, the tests should be using
`os.CreateTemp()` for this anyway.

Signed-off-by: Mason Malone <mmalone@adobe.com>
@MasonM
Copy link
Contributor Author

MasonM commented Sep 27, 2024

Thanks @agilgur5! I entered #13670 to address your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build or GithubAction/CI issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants