-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: convert k8s submissions from pods to jobs #9296
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stoksc/feat/kubernetesjobs #9296 +/- ##
==============================================================
+ Coverage 48.60% 49.03% +0.43%
==============================================================
Files 1233 1233
Lines 158981 159205 +224
Branches 2778 2777 -1
==============================================================
+ Hits 77271 78074 +803
+ Misses 81536 80957 -579
Partials 174 174
Flags with carried forward coverage won't be shown. Click here to find out more.
|
07bfc25
to
e99300c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notes from my first skim through
a49c29b
to
dd9f56d
Compare
ae8e8e0
to
101af3d
Compare
e8a4aab
to
aad1242
Compare
All the failing tests are also failing on main, but I'm going to make an attempt to fix the relevant ones before landing at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but before you merge I pointed out 2 changes that I think should either be called out explicitly in the PR description or given their own PR because I think they're unrelated to this feat
Also, I think you should reword the log messages into something a little more clear by putting the action verb first.
Besides that, just style comments, which I assume will make it into their own PR
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Pulling) | ||
j.container.State = cproto.Pulling | ||
j.informTaskResourcesState() | ||
|
||
j.syslog.Infof("saw pod %s in state %s", podName, cproto.Starting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't love the wording of these "saw pod X in state Y", maybe try "pulling/starting/ pod %s" --> strings.ToLower(cproto.Pulling) + " pod " + stringName
6fa894e
to
6d8e486
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ed5620b
to
96f1ce4
Compare
@@ -99,11 +99,13 @@ build/mock_gen.stamp: $(MOCK_INPUTS) | |||
mockery --quiet --name=PodInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename pod_iface.go | |||
mockery --quiet --name=EventInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename event_iface.go | |||
mockery --quiet --name=NodeInterface --srcpkg=k8s.io/client-go/kubernetes/typed/core/v1 --output internal/mocks --filename node_iface.go | |||
mockery --quiet --name=JobInterface --srcpkg=k8s.io/client-go/kubernetes/typed/batch/v1 --output internal/mocks --filename job_iface.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to switch to a config file ASAP
req.State = msg.State | ||
if sproto.ScheduledStates[req.State] { | ||
k.allocationIDToRunningPods[id]++ | ||
k.allocationIDToRunningPods[msg.AllocationID] += msg.NumPods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the job know how many pods it has running, it is a little tragic that we need to keep track of this map
feel free to just make a follow up ticket or ignore any of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, tragic but no it doesn't know how many are "running" where our definition of "running" is post scheduling and bound to a node.
2274790
to
10af218
Compare
various ci fixes add consts fix import style revert unneeded helm changes last bit of review feedback lint fixes bring in carolina's config changes tmp stuff that i'm definitely keeping amends lints debug logging for weird failure only in CI debug logging test fixes test fixes fixes for reattach tests self review more self review fix annoyance pass numPods to recreateJobHandler final self review fix: job queue state not recovered on reattach various fixes
10af218
to
8806bff
Compare
landing this into the feature branch (as soon as I get green tests) where it will get a release note, docs, extra tests, and maybe some more adjustments (nothing major) before it lands later this week. |
This change updates the Kubernetes resource manager to submit one Kubernetes job per Determined allocation instead of many pods. This is complicated but we think it is worth it because: - Jobs play nice with resource quotas and other Kubernetes features out of the box. - Eventually we can delegate restarts, TTL, pause/resume (using suspend), and more to jobs. - They allow us to better integrate with Kueue and other tools in the ml ecosystem. - Supporting VolcanoJobs (or similar alternatives) alongside Jobs is realistic. - The refactor is net positive w.r.t. test coverage (20% to 80%) and code quality. This commit is the result of several PRs, enumerated here for easier discovery. - #9296 contains most of the code changes. - #9443 - #9447 - #9450 - #9451 Co-authored-by: Carolina Calderon <carolina.calderon@hpe.com>
Ticket
[RM-203,RM-204,RM-205,RM-206,RM-208,RM-213]
Description
Update our Kubernetes resource manager to submit one job per Determined task instead of many pods. This is a complicated change but we think it is worth it because:
Test Plan
Covered by automated tests.
Checklist
docs/release-notes/
.See Release Note for details.