Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convert k8s submissions from pods to jobs (#9296) #9438

Merged
merged 6 commits into from
May 31, 2024

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented May 29, 2024

This change updates our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:

  • Jobs play nice with resource quotas and other Kubernetes features out of the box.
  • Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs.
  • They allow us to better integrate with Kueue and other tools in the ml ecosystem.
  • Supporting VolcanoJobs (or similar alternatives) alongside Jobs is realistic.
  • The refactor is net positive w.r.t. test coverage (20% to 80%) and code quality.

This commit is the result of several PRs, enumerated here for easier discovery.

Ticket

Description

This is a feature branch. It already contains the core of the change, but we'll be merging release notes, docs, extra tests, and maybe some more adjustments (nothing major) before it lands later this week.

Test Plan

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:
- Jobs play nice with resource quotas and other Kubernetes features out of the box.
- Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs.
- They allow us to integrate with kueue immediately.
- If we want to support VolcanoJobs we are much closer (and it is easier to maintain Job+VolcanoJob than Pods+VolcanoJob).

This change also contains some unrelated CI QoL changes I found useful while working.
Copy link

netlify bot commented May 29, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 5f135ae
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6658f87f5c43e600096014f6

@stoksc
Copy link
Contributor Author

stoksc commented May 29, 2024

As of 66f9d1e CI is very happy so this should be a good starting point for any last minute refactors.

Copy link

codecov bot commented May 29, 2024

Codecov Report

Attention: Patch coverage is 75.34884% with 424 lines in your changes are missing coverage. Please review.

Project coverage is 49.03%. Comparing base (515c135) to head (5f135ae).
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9438      +/-   ##
==========================================
+ Coverage   48.60%   49.03%   +0.43%     
==========================================
  Files        1233     1233              
  Lines      158981   159202     +221     
  Branches     2778     2778              
==========================================
+ Hits        77271    78063     +792     
+ Misses      81536    80965     -571     
  Partials      174      174              
Flag Coverage Δ
backend 43.54% <76.64%> (+1.39%) ⬆️
harness 63.97% <9.09%> (-0.05%) ⬇️
web 44.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/informer.go 86.36% <100.00%> (+5.41%) ⬆️
master/internal/rm/kubernetesrm/log.go 90.00% <100.00%> (+0.52%) ⬆️
master/internal/sproto/task_actor.go 57.14% <ø> (+9.52%) ⬆️
master/pkg/cproto/state.go 31.70% <ø> (+14.63%) ⬆️
master/internal/rm/agentrm/resource_pool.go 34.13% <0.00%> (ø)
master/internal/rm/kubernetesrm/request_queue.go 89.89% <94.11%> (+5.28%) ⬆️
master/internal/sproto/task.go 39.77% <0.00%> (-1.14%) ⬇️
master/internal/webhooks/postgres_webhook.go 57.37% <0.00%> (+0.26%) ⬆️
master/internal/task/allocation.go 76.35% <80.00%> (-0.05%) ⬇️
master/pkg/tasks/task.go 60.20% <0.00%> (-0.64%) ⬇️
... and 10 more

... and 5 files with indirect coverage changes

@determined-ci determined-ci requested a review from a team May 29, 2024 18:56
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 29, 2024
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I've already approved all the commits individually

@stoksc stoksc requested review from azhou-determined and removed request for MikhailKardash May 30, 2024 23:45
Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python file lgtm

@stoksc
Copy link
Contributor Author

stoksc commented May 30, 2024

once tests pass i'm going to land this without an explicit approval from @tara-det-ai since she approved the docs pr that landed into this. jk i'm too slow

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stoksc stoksc merged commit 0fdb822 into main May 31, 2024
103 of 114 checks passed
@stoksc stoksc deleted the stoksc/feat/kubernetesjobs branch May 31, 2024 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants