Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: add nightly gke cluster cleanup job #9031

Merged
merged 7 commits into from
Apr 5, 2024

Conversation

amandavialva01
Copy link
Contributor

Description

Create nightly cleanup job that removes all namespaces (and therefore all k8s objects within them) from the CircleCI GKE cluster and deletes all buckets created by CI jobs used for testing with that cluster.

Successful run with resources to delete: https://app.circleci.com/pipelines/github/determined-ai/determined/52752/workflows/ef46cdef-ca8d-4345-a1e7-49e15d7473aa/jobs/2370042

Successful run with no resources to delete: https://app.circleci.com/pipelines/github/determined-ai/determined/52764/workflows/0a6c0b3f-8e4b-427e-ba92-d6f7b5e930a6/jobs/2370726

Test Plan

CI passes.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-10117

@amandavialva01 amandavialva01 requested a review from a team as a code owner March 20, 2024 21:00
@cla-bot cla-bot bot added the cla-signed label Mar 20, 2024
@amandavialva01 amandavialva01 changed the title add nightly gke cluster cleanup job chore: add nightly gke cluster cleanup job Mar 20, 2024
@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from fe5d76d to 080d7ac Compare March 20, 2024 21:02
Copy link

codecov bot commented Mar 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 46.63%. Comparing base (5cb7927) to head (80bcf54).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9031      +/-   ##
==========================================
- Coverage   46.64%   46.63%   -0.01%     
==========================================
  Files        1172     1172              
  Lines      143619   143619              
  Branches     2410     2410              
==========================================
- Hits        66986    66983       -3     
- Misses      76428    76431       +3     
  Partials      205      205              
Flag Coverage Δ
backend 43.38% <ø> (ø)
harness 63.99% <ø> (-0.01%) ⬇️
web 37.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 3 files with indirect coverage changes

Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from f26f546 to fc9c543 Compare March 21, 2024 17:48
Copy link
Contributor

@corban-beaird corban-beaird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also LGTM! 😄

@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from cfc0dbf to e93351b Compare April 5, 2024 00:33
@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from e93351b to d282817 Compare April 5, 2024 00:39
@amandavialva01 amandavialva01 requested review from a team as code owners April 5, 2024 00:39
@amandavialva01 amandavialva01 requested review from NicholasBlaskey, thiagodallacqua-hpe and azhou-determined and removed request for a team April 5, 2024 00:39
Copy link

cla-bot bot commented Apr 5, 2024

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @djanicekpach, @jesse-amano-hpe, @determined-ci on file. In order for us to review and merge your code, please start the CLA process at https://determined.ai/cla.

After we approve your CLA, we will update the contributors list (private) and comment @cla-bot[bot] check to rerun the check.

@cla-bot cla-bot bot removed the cla-signed label Apr 5, 2024
@determined-ci
Copy link
Collaborator

Docsite preview being generated for this PR.
You can (eventually) find the generated docsite here.

@determined-ci determined-ci requested a review from a team April 5, 2024 00:39
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Apr 5, 2024
@amandavialva01 amandavialva01 changed the base branch from feature/k8sTestWithSharedCluster to main April 5, 2024 00:47
Copy link
Member

@dannysauer dannysauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for the sake of moving forward, but there are come changes I'd like to see longer-term. :)

- run:
name: Delete GKE CI Cluster Namespaces
command: |
kubectl get namespace | grep -Eo "^test-cpu-[a-z0-9]+-[a-z0-9]+-[0-9]" | xargs -L1 kubectl delete namespace || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right-er way to mass delete from kubernetes is to apply labels, and then delete items which have a given label value. If the namespaces were created with a label which was, for example, like "determiend.ai/ci" and an arbitrary value (perhaps the ID of the test), then this could be as simple as:

Suggested change
kubectl get namespace | grep -Eo "^test-cpu-[a-z0-9]+-[a-z0-9]+-[0-9]" | xargs -L1 kubectl delete namespace || true
kubectl delete namespace -l determiend.ai/ci

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea. I will look into adding a unique label to each namespace to simplify this cleanup job in the future

Comment on lines 2761 to 2767
gcloud compute networks get-effective-firewalls $GCP_NETWORK_NAME --project "$GCP_PROJECT_ID" \
--format="table(name)" | tail -n +2 | \
while read fw; do
if [[ $fw =~ "k8s" ]]; then
gcloud compute firewall-rules delete "$fw" --quiet --project "$GCP_PROJECT_ID"
fi
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gcloud delete commands mostly take multiple targets for delete, so this could be done (faster) with a single command along the lines of this:

Suggested change
gcloud compute networks get-effective-firewalls $GCP_NETWORK_NAME --project "$GCP_PROJECT_ID" \
--format="table(name)" | tail -n +2 | \
while read fw; do
if [[ $fw =~ "k8s" ]]; then
gcloud compute firewall-rules delete "$fw" --quiet --project "$GCP_PROJECT_ID"
fi
done
gcloud compute firewall-rules delete "$fw" --quiet --project "$GCP_PROJECT_ID" $( \
gcloud compute networks get-effective-firewalls $GCP_NETWORK_NAME --project "$GCP_PROJECT_ID" --regexp=".*k8s.*" \
--format="table(name)" | tail -n +2
)

Piping into a while loop like that means any command inside the loop which reads from stdin will consume output before the read gets another chance, so it's generally ideal to avoid that structure for maintenance (works now, but what if someone adds a second command in the loop later?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually be even easier with a cloud-custodian yaml file which could just select the stuff to remove. :D

Copy link
Contributor Author

@amandavialva01 amandavialva01 Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the kubectl delete namespace command removes firewall rules, target pools, and forwarding rules (since those resources are namespace). So, I'll remove these three steps from the cleaup job since they get inherently deleted from the deletion of the namespace!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, even better!

@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from d282817 to 16d7008 Compare April 5, 2024 01:52
@cla-bot cla-bot bot added the cla-signed label Apr 5, 2024
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label Apr 5, 2024
Copy link

netlify bot commented Apr 5, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 80bcf54
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/660f5f82a99ac000087e898a

@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from 16d7008 to ea01a59 Compare April 5, 2024 01:52
@amandavialva01 amandavialva01 force-pushed the amanda/nightlyGKEClusterCleanupJob branch from ea01a59 to 2f63f83 Compare April 5, 2024 01:57
@amandavialva01 amandavialva01 enabled auto-merge (squash) April 5, 2024 02:24
@amandavialva01 amandavialva01 merged commit 7fc8d7a into main Apr 5, 2024
68 of 81 checks passed
@amandavialva01 amandavialva01 deleted the amanda/nightlyGKEClusterCleanupJob branch April 5, 2024 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants