-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add option for Checkpoint_GC pod spec in task container defaults #9406
Conversation
✅ Deploy Preview for determined-ui canceled.
|
tools/devcluster.yaml
Outdated
@@ -59,6 +59,10 @@ stages: | |||
enable_cors: true | |||
observability: | |||
enable_prometheus: true | |||
task_container_defaults: | |||
checkpoint_gc_pod_spec: | |||
apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we would only need the structure/outline of the pod spec to be defined
master/pkg/tasks/task_gc.go
Outdated
podSpec := g.LegacyConfig.Environment.PodSpec() | ||
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil { | ||
gcPodSpec := (*expconf.PodSpec)(g.Base.TaskContainerDefaults.CheckpointGCPodSpec) | ||
podSpec = schemas.Merge(podSpec, gcPodSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge the pod spec from the experiment here in case it has node affinities that we won't want to add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted!
Replaced the code with overwriting the legacy PodSpec value
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9406 +/- ##
==========================================
+ Coverage 48.60% 48.64% +0.04%
==========================================
Files 1233 1233
Lines 159002 159010 +8
Branches 2778 2778
==========================================
+ Hits 77283 77356 +73
+ Misses 81545 81480 -65
Partials 174 174
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes look great, nice work. We need a few additional extra changes to
- We should update the Helm chart to allow users to set this when deploying Determined on K8s with Helm
cpu_pod_spec would be a good example of how to go about this
determined/helm/charts/determined/templates/master-config.yaml
Lines 225 to 227 in 3dfb9ec
{{- if .Values.taskContainerDefaults.gpuPodSpec }} | |
gpu_pod_spec: {{ .Values.taskContainerDefaults.gpuPodSpec | toJson }} | |
{{- end }} |
- doc updates. We have docs on master config / helm charts / and custom pod specs. We should update these docs to include this new feature.
cpu_pod_spec: {{ .Values.taskContainerDefaults.cpuPodSpec | toJson }} |
- ``cpuPodSpec``: Sets the default pod spec for all non-GPU tasks. See :ref:`custom-pod-specs` |
:ref:`helm-config-reference` under ``taskContainerDefaults.cpuPodSpec`` and |
- For any user facing change we need a release note. We have some instructions on how to add them here
https://github.com/determined-ai/determined/tree/main/docs/release-notes
@@ -494,6 +521,13 @@ func TestPodSpecsDefaultMerging(t *testing.T) { | |||
}, | |||
}, | |||
}, | |||
CheckpointGCPodSpec: &k8sV1.Pod{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this test fail without this change?
This is testing merging into expconf so I don't know if we need this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. TestPodSpecsDefaultMerging
doesn't need CheckpointGCPodSpec while merging.
Tests do not fail without this.
tools/devcluster.yaml
Outdated
@@ -59,6 +59,10 @@ stages: | |||
enable_cors: true | |||
observability: | |||
enable_prometheus: true | |||
task_container_defaults: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this change since this devcluster won't be on Kubernetes and most developers won't use this
if someone needs to test it they can add it themself
} | ||
|
||
for testCase, testVars := range tests { | ||
t.Run(testCase, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test looks great, nice job
f7b456f
to
067b353
Compare
**New Features** | ||
|
||
- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec for | ||
individual experiment. You can specify a custom ``checkpointGcPodSpec`` directly in the experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can specify a custom ``checkpointGcPodSpec`` directly in the experiment configuration file
This is for task container defaults only right and not in the experiment yaml?
Custom CheckpointGC Pod Specs | ||
******************************* | ||
|
||
Determined also provides a way to configure CheckpointGC pod specs for individual experiment. You can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about not how we don't let you configure per expeirment
helm/charts/determined/values.yaml
Outdated
@@ -264,6 +264,10 @@ taskContainerDefaults: | |||
# cpuPodSpec: | |||
# gpuPodSpec: | |||
|
|||
# Configure a custom CheckpointGC pod spec for experiments. If a pod spec is defined for an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about how we don't support configuring this per experiment
harness/determined/deploy/gke/cli.py
Outdated
@@ -280,6 +280,12 @@ def configure_helm(args: argparse.Namespace) -> None: | |||
|
|||
helm_values["taskContainerDefaults"]["cpuPodSpec"] = cpu_pod_spec | |||
|
|||
checkpoint_gc_pod_spec = make_spec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a little worried this will overwrite the default value
pod_spec = {"apiVersion": "v1", "kind": "Pod"} |
so when you start a cluster with a checkpoint gc spec, this will add one even if a user has not explicitly given one. This would mean we would no longer use the experiment's pod spec in the default case.
The broader point is gke/cli.py is experimental and hasn't been really been updated in like 2 years. I would avoid spending any time on this and just not make any changes to this file.
Docsite preview being generated for this PR. |
8bb2160
to
e14fe37
Compare
- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec. | ||
This configuration is done using the ``task_container_defaults.checkpoint_gc_pod_spec`` field | ||
within your ``value.yaml`` file. When a custom CheckpointGC pod spec is defined, it takes | ||
precedence over the default CheckpointGC pod specifications. This provides flexibility to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be extra clear about what "default CheckpointGC pod specfications" means
I would mention that previous it would use the experiments pod spec but it can be overriden to use a configurable one with this field
|
||
Determined also provides a way to configure CheckpointGC pod specs. This configuration is done using | ||
the ``task_container_defaults.checkpoint_gc_pod_spec`` field within your ``value.yaml`` file. When a | ||
custom CheckpointGC pod spec is defined, it takes precedence over the default CheckpointGC pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment with the release note about clarifying what "default checkpointGC pod" refers to
**New Features** | ||
|
||
- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec. | ||
This configuration is done using the ``task_container_defaults.checkpoint_gc_pod_spec`` field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value.yaml
expects taskContainerDefaults.checkpointGcSpec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely missed it. Good catch!
@@ -37,10 +37,16 @@ func (g GCCkptSpec) ToTaskSpec() TaskSpec { | |||
// Set Environment. | |||
// Keep only the EnvironmentVariables provided by the experiment's config. | |||
envVars := g.LegacyConfig.Environment.EnvironmentVariables() | |||
|
|||
podSpec := g.LegacyConfig.Environment.PodSpec() | |||
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be super safe here, I don't think this can be nil but inside MergeIntoExpConfig
there is a nil check for it.
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil { | |
if g.Base.TaskContainerDefaults != nil && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As g.Base.TaskContainerDefaults
is of type mismatch with nil
, it will be better to create a new func to check if it is nil
. Something like this:
if !g.Base.TaskContainerDefaults.IsNil() && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil
func (c *TaskContainerDefaultsConfig) IsNil() bool {
return c == nil
}
Other option we have is to leverage the validate
function and check for err != nil
. Downside will be that err
might not be empty because of other parameters being checked within that function.
err := g.Base.TaskContainerDefaults.Validate()
if err != nil && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {
Let me know what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my bad since g.Base.TaskContainersDefaultConfig
is not a pointer we don't need to do anything
The code was good as is. I made a mistake thinking it was a pointer.
The IsNil
check won't add anything since it will always not be nil.
helm/charts/determined/values.yaml
Outdated
@@ -264,6 +264,10 @@ taskContainerDefaults: | |||
# cpuPodSpec: | |||
# gpuPodSpec: | |||
|
|||
# Configure a custom CheckpointGC pod spec for experiments. When a custom CheckpointGC pod spec | |||
# is defined, it takes precedence over the default CheckpointGC pod specifications. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about the "default spec"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last comment is about not needing the IsNil
i also requested a review from @tara-det-ai since there are doc changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Great work on this PR!
Ticket
RM-162
Description
A user complained about the below issue:
Overview of the issue: We pull a lot of stuff from the experiment config into the checkpoint GC spec without really thinking it through. One thing we pull in are the cpu requests/limits from the training job.
This assumption that the experiment’s pod spec won’t work for every user. Issue shows a community user who has a GPU affinity in their pod spec which prevents their gc job from running.
In this PR we give the user the ability to tell us what their checkpoint gc pod spec looks like. We should add a checkpoint gc pod spec that will overwrite the experiments pod spec (if a user specifies a checkpoint gc pod spec) to task container defaults.
Test Plan
CheckpointGCSpec
added to the master config.devcluster
with right value flow of the data from master config using command:det -u admin master config
. Connected though Kubernetes PodsRunning the below command
det e create e2e_tests/tests/fixtures/no_op/gc_checkpoints_decreasing.yaml e2e_tests/tests/fixtures/no_op/ -f
with
devcluster -c tools/k8s/devcluster.yaml
Monitored logs through
det master logs | grep "CHECKPOINT_GC"
commandChecklist
docs/release-notes/
.See Release Note for details.