test: fix tensorboard reattach k8s flake [RM-39] #8906
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fix flake https://app.circleci.com/pipelines/github/determined-ai/determined/51619/workflows/5358e0ec-7fc1-4893-b1d3-5422a43be44a/jobs/2298153
The basic premise of the flake was that the experiment tensorboard uploads would wait on gcs 429 errors making the experiment take longer than 60 seconds
https://circleci.com/api/v1.1/project/github/determined-ai/determined/2293269/output/145/0?file=true&allocation-id=65dcd1db5c29034450ab1342-0-build%2FABCDEFGH
My understanding of why this is happening recently is we changed how retrying works in
#8780
here we would just skip tensorboard uploads that failed due to 429s, but now we retry it and block on it
The fix is just to run a shorter experiment since this isn't what we are testing for this test.
Test Plan
Merge it and see if the flakes keep happening
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket