Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: convert k8s submissions from pods to jobs #9296

Merged
merged 13 commits into from
May 29, 2024

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented May 2, 2024

Ticket

[RM-203,RM-204,RM-205,RM-206,RM-208,RM-213]

Description

Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:

  • Jobs play nice with resource quotas and other Kubernetes features out of the box.
  • Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs.
  • They allow us to integrate with kueue immediately.
  • If we want to support VolcanoJobs we are much closer (and it is easier to maintain Job+VolcanoJob than Pods+VolcanoJob).

Test Plan

Covered by automated tests.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.
  • File ticket for rp.namespace weirdness, graceful shutdown, using indexed completions

@cla-bot cla-bot bot added the cla-signed label May 2, 2024
Copy link

netlify bot commented May 2, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 8806bff
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6656522457a5fe00081f0f51
😎 Deploy Preview https://deploy-preview-9296--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented May 2, 2024

Codecov Report

Attention: Patch coverage is 75.71014% with 419 lines in your changes are missing coverage. Please review.

Project coverage is 49.03%. Comparing base (515c135) to head (8806bff).

Additional details and impacted files
@@                      Coverage Diff                       @@
##           stoksc/feat/kubernetesjobs    #9296      +/-   ##
==============================================================
+ Coverage                       48.60%   49.03%   +0.43%     
==============================================================
  Files                            1233     1233              
  Lines                          158981   159205     +224     
  Branches                         2778     2777       -1     
==============================================================
+ Hits                            77271    78074     +803     
+ Misses                          81536    80957     -579     
  Partials                          174      174              
Flag Coverage Δ
backend 43.56% <77.00%> (+1.41%) ⬆️
harness 63.97% <9.09%> (-0.05%) ⬇️
web 44.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/informer.go 86.36% <100.00%> (+5.41%) ⬆️
master/internal/rm/kubernetesrm/log.go 90.00% <100.00%> (+0.52%) ⬆️
master/internal/sproto/task_actor.go 57.14% <ø> (+9.52%) ⬆️
master/pkg/cproto/state.go 31.70% <ø> (+14.63%) ⬆️
master/internal/rm/agentrm/resource_pool.go 34.13% <0.00%> (ø)
master/internal/rm/kubernetesrm/request_queue.go 89.89% <94.11%> (+5.28%) ⬆️
master/internal/sproto/task.go 40.90% <0.00%> (ø)
master/internal/webhooks/postgres_webhook.go 57.10% <0.00%> (ø)
master/internal/db/postgres_test_utils.go 82.00% <60.00%> (-0.16%) ⬇️
master/internal/task/allocation.go 76.35% <80.00%> (-0.05%) ⬇️
... and 11 more

... and 4 files with indirect coverage changes

@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 2 times, most recently from 07bfc25 to e99300c Compare May 6, 2024 21:36
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes from my first skim through

master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/job.go Outdated Show resolved Hide resolved
helm/charts/determined/values.yaml Outdated Show resolved Hide resolved
helm/charts/determined/templates/master-deployment.yaml Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/spec.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/spec.go Show resolved Hide resolved
@stoksc stoksc changed the title Stoksc/feat/pods2jobs feat: convert k8s submissions from pods to jobs May 9, 2024
@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 5 times, most recently from ae8e8e0 to 101af3d Compare May 16, 2024 18:27
@stoksc stoksc force-pushed the stoksc/feat/pods2jobs branch 4 times, most recently from e8a4aab to aad1242 Compare May 17, 2024 21:17
@stoksc stoksc marked this pull request as ready for review May 17, 2024 23:01
@stoksc stoksc requested review from a team as code owners May 17, 2024 23:01
@stoksc
Copy link
Contributor Author

stoksc commented May 17, 2024

All the failing tests are also failing on main, but I'm going to make an attempt to fix the relevant ones before landing at least.

@stoksc stoksc changed the base branch from main to stoksc/feat/kubernetesjobs May 17, 2024 23:25
@stoksc stoksc requested review from a team as code owners May 17, 2024 23:25
@stoksc stoksc requested a review from ashtonG May 17, 2024 23:25
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but before you merge I pointed out 2 changes that I think should either be called out explicitly in the PR description or given their own PR because I think they're unrelated to this feat
Also, I think you should reword the log messages into something a little more clear by putting the action verb first.
Besides that, just style comments, which I assume will make it into their own PR

Comment on lines 335 to 339
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Pulling)
j.container.State = cproto.Pulling
j.informTaskResourcesState()

j.syslog.Infof("saw pod %s in state %s", podName, cproto.Starting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't love the wording of these "saw pod X in state Y", maybe try "pulling/starting/ pod %s" --> strings.ToLower(cproto.Pulling) + " pod " + stringName

@determined-ci determined-ci requested a review from a team May 21, 2024 23:29
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 21, 2024
@stoksc stoksc changed the base branch from stoksc/feat/kubernetesjobs to main May 22, 2024 00:40
@determined-ci determined-ci removed the documentation Improvements or additions to documentation label May 22, 2024
Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -99,11 +99,13 @@ build/mock_gen.stamp: $(MOCK_INPUTS)
mockery --quiet --name=PodInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename pod_iface.go
mockery --quiet --name=EventInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename event_iface.go
mockery --quiet --name=NodeInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename node_iface.go
mockery --quiet --name=JobInterface --srcpkg=k8s.io/client-go/kubernetes/typed/batch/v1 --output internal/mocks --filename job_iface.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to switch to a config file ASAP

https://hpe-aiatscale.atlassian.net/browse/RM-277

req.State = msg.State
if sproto.ScheduledStates[req.State] {
k.allocationIDToRunningPods[id]++
k.allocationIDToRunningPods[msg.AllocationID] += msg.NumPods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the job know how many pods it has running, it is a little tragic that we need to keep track of this map

feel free to just make a follow up ticket or ignore any of these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, tragic but no it doesn't know how many are "running" where our definition of "running" is post scheduling and bound to a node.

stoksc added 13 commits May 28, 2024 17:51
various ci fixes

add consts

fix import style

revert unneeded helm changes

last bit of review feedback

lint fixes

bring in carolina's config changes

tmp

stuff that i'm definitely keeping

amends

lints

debug logging for weird failure only in CI

debug logging

test fixes

test fixes

fixes for reattach tests

self review

more self review

fix annoyance

pass numPods to recreateJobHandler

final self review

fix: job queue state not recovered on reattach

various fixes
@stoksc stoksc changed the base branch from main to stoksc/feat/kubernetesjobs May 28, 2024 21:53
@stoksc
Copy link
Contributor Author

stoksc commented May 28, 2024

landing this into the feature branch (as soon as I get green tests) where it will get a release note, docs, extra tests, and maybe some more adjustments (nothing major) before it lands later this week.

@stoksc stoksc merged commit 66f9d1e into stoksc/feat/kubernetesjobs May 29, 2024
91 of 102 checks passed
@stoksc stoksc deleted the stoksc/feat/pods2jobs branch May 29, 2024 00:52
stoksc added a commit that referenced this pull request May 31, 2024
This change updates the Kubernetes resource manager to submit one Kubernetes job per Determined allocation instead of many pods. This is complicated but we think it is worth it because:
- Jobs play nice with resource quotas and other Kubernetes features out of the box.
- Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs.
- They allow us to better integrate with Kueue and other tools in the ml ecosystem.
- Supporting VolcanoJobs (or similar alternatives) alongside Jobs is realistic.
- The refactor is net positive w.r.t. test coverage (20% to 80%) and code quality.

This commit is the result of several PRs, enumerated here for easier discovery.
- #9296 contains most of the code changes.
- #9443 
- #9447 
- #9450 
- #9451

Co-authored-by: Carolina Calderon <carolina.calderon@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants