Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option for Checkpoint_GC pod spec in task container defaults #9406

Merged
merged 9 commits into from
May 31, 2024

Conversation

ShreyaLnuHpe
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe commented May 22, 2024

Ticket

RM-162

Description

A user complained about the below issue:
Overview of the issue: We pull a lot of stuff from the experiment config into the checkpoint GC spec without really thinking it through. One thing we pull in are the cpu requests/limits from the training job.
This assumption that the experiment’s pod spec won’t work for every user. Issue shows a community user who has a GPU affinity in their pod spec which prevents their gc job from running.

In this PR we give the user the ability to tell us what their checkpoint gc pod spec looks like. We should add a checkpoint gc pod spec that will overwrite the experiments pod spec (if a user specifies a checkpoint gc pod spec) to task container defaults.

Test Plan

  1. Added unit test coverage for overwriting the default Legacy config in the Experiment Pod Spec.
  2. Added unit test coverage for the CheckpointGCSpec added to the master config.
  3. Working devcluster with right value flow of the data from master config using command: det -u admin master config. Connected though Kubernetes Pods
  4. Tested helm changes as helm chart applied the config to master using the port-forwarding.
  5. Tested with setting the following values in the determined/tools/k8s/devcluster.yaml file
        task_container_defaults:
          checkpoint_gc_pod_spec:
            apiVersion: v1
            kind: Pod
            metadata:
              name: example-pod
            spec:
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                    - matchExpressions:
                      - key: does-not-exist
                        operator: Exists

Running the below command
det e create e2e_tests/tests/fixtures/no_op/gc_checkpoints_decreasing.yaml e2e_tests/tests/fixtures/no_op/ -f
with devcluster -c tools/k8s/devcluster.yaml
Monitored logs through det master logs | grep "CHECKPOINT_GC" command

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner May 22, 2024 05:27
@cla-bot cla-bot bot added the cla-signed label May 22, 2024
Copy link

netlify bot commented May 22, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit a5a0309
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6658b6cef46a7a0008482e9f

@@ -59,6 +59,10 @@ stages:
enable_cors: true
observability:
enable_prometheus: true
task_container_defaults:
checkpoint_gc_pod_spec:
apiVersion: v1
Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we would only need the structure/outline of the pod spec to be defined

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner May 22, 2024 23:47
@ShreyaLnuHpe ShreyaLnuHpe changed the title Add option for Checkpoint_GC pod spec in task container defaults chore: add option for Checkpoint_GC pod spec in task container defaults May 22, 2024
podSpec := g.LegacyConfig.Environment.PodSpec()
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {
gcPodSpec := (*expconf.PodSpec)(g.Base.TaskContainerDefaults.CheckpointGCPodSpec)
podSpec = schemas.Merge(podSpec, gcPodSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should merge the pod spec from the experiment here in case it has node affinities that we won't want to add

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted!
Replaced the code with overwriting the legacy PodSpec value

Copy link

codecov bot commented May 24, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 48.64%. Comparing base (b96ccba) to head (a5a0309).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9406      +/-   ##
==========================================
+ Coverage   48.60%   48.64%   +0.04%     
==========================================
  Files        1233     1233              
  Lines      159002   159010       +8     
  Branches     2778     2778              
==========================================
+ Hits        77283    77356      +73     
+ Misses      81545    81480      -65     
  Partials      174      174              
Flag Coverage Δ
backend 42.29% <77.77%> (+0.13%) ⬆️
harness 64.02% <ø> (ø)
web 44.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/pkg/tasks/task_gc.go 85.89% <100.00%> (+85.89%) ⬆️
master/cmd/determined-master/root.go 49.70% <50.00%> (+<0.01%) ⬆️
master/pkg/model/task_container_defaults.go 84.51% <66.66%> (-0.36%) ⬇️

... and 1 file with indirect coverage changes

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look great, nice work. We need a few additional extra changes to

  1. We should update the Helm chart to allow users to set this when deploying Determined on K8s with Helm

cpu_pod_spec would be a good example of how to go about this

{{- if .Values.taskContainerDefaults.gpuPodSpec }}
gpu_pod_spec: {{ .Values.taskContainerDefaults.gpuPodSpec | toJson }}
{{- end }}

  1. doc updates. We have docs on master config / helm charts / and custom pod specs. We should update these docs to include this new feature.

cpu_pod_spec: {{ .Values.taskContainerDefaults.cpuPodSpec | toJson }}

- ``cpuPodSpec``: Sets the default pod spec for all non-GPU tasks. See :ref:`custom-pod-specs`

:ref:`helm-config-reference` under ``taskContainerDefaults.cpuPodSpec`` and

  1. For any user facing change we need a release note. We have some instructions on how to add them here

https://github.com/determined-ai/determined/tree/main/docs/release-notes

@@ -494,6 +521,13 @@ func TestPodSpecsDefaultMerging(t *testing.T) {
},
},
},
CheckpointGCPodSpec: &k8sV1.Pod{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test fail without this change?

This is testing merging into expconf so I don't know if we need this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. TestPodSpecsDefaultMerging doesn't need CheckpointGCPodSpec while merging.
Tests do not fail without this.

@@ -59,6 +59,10 @@ stages:
enable_cors: true
observability:
enable_prometheus: true
task_container_defaults:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably remove this change since this devcluster won't be on Kubernetes and most developers won't use this

if someone needs to test it they can add it themself

}

for testCase, testVars := range tests {
t.Run(testCase, func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks great, nice job

@ShreyaLnuHpe ShreyaLnuHpe changed the title chore: add option for Checkpoint_GC pod spec in task container defaults feat: add option for Checkpoint_GC pod spec in task container defaults May 24, 2024
@determined-ci determined-ci added the documentation Improvements or additions to documentation label May 25, 2024
@determined-ci determined-ci requested a review from a team May 25, 2024 02:02
**New Features**

- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec for
individual experiment. You can specify a custom ``checkpointGcPodSpec`` directly in the experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can specify a custom ``checkpointGcPodSpec`` directly in the experiment configuration file

This is for task container defaults only right and not in the experiment yaml?

Custom CheckpointGC Pod Specs
*******************************

Determined also provides a way to configure CheckpointGC pod specs for individual experiment. You can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about not how we don't let you configure per expeirment

@@ -264,6 +264,10 @@ taskContainerDefaults:
# cpuPodSpec:
# gpuPodSpec:

# Configure a custom CheckpointGC pod spec for experiments. If a pod spec is defined for an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about how we don't support configuring this per experiment

@@ -280,6 +280,12 @@ def configure_helm(args: argparse.Namespace) -> None:

helm_values["taskContainerDefaults"]["cpuPodSpec"] = cpu_pod_spec

checkpoint_gc_pod_spec = make_spec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little worried this will overwrite the default value

pod_spec = {"apiVersion": "v1", "kind": "Pod"}

so when you start a cluster with a checkpoint gc spec, this will add one even if a user has not explicitly given one. This would mean we would no longer use the experiment's pod spec in the default case.

The broader point is gke/cli.py is experimental and hasn't been really been updated in like 2 years. I would avoid spending any time on this and just not make any changes to this file.

@ShreyaLnuHpe ShreyaLnuHpe requested a review from a team as a code owner May 29, 2024 00:24
@ShreyaLnuHpe ShreyaLnuHpe requested a review from hkang1 May 29, 2024 00:24
@determined-ci
Copy link
Collaborator

Docsite preview being generated for this PR.
You can (eventually) find the generated docsite here.

- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec.
This configuration is done using the ``task_container_defaults.checkpoint_gc_pod_spec`` field
within your ``value.yaml`` file. When a custom CheckpointGC pod spec is defined, it takes
precedence over the default CheckpointGC pod specifications. This provides flexibility to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be extra clear about what "default CheckpointGC pod specfications" means

I would mention that previous it would use the experiments pod spec but it can be overriden to use a configurable one with this field


Determined also provides a way to configure CheckpointGC pod specs. This configuration is done using
the ``task_container_defaults.checkpoint_gc_pod_spec`` field within your ``value.yaml`` file. When a
custom CheckpointGC pod spec is defined, it takes precedence over the default CheckpointGC pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment with the release note about clarifying what "default checkpointGC pod" refers to

**New Features**

- Kubernetes: Add a feature where Determined offers the users to provide custom Checkpoint GC pod spec.
This configuration is done using the ``task_container_defaults.checkpoint_gc_pod_spec`` field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value.yaml expects taskContainerDefaults.checkpointGcSpec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely missed it. Good catch!

@@ -37,10 +37,16 @@ func (g GCCkptSpec) ToTaskSpec() TaskSpec {
// Set Environment.
// Keep only the EnvironmentVariables provided by the experiment's config.
envVars := g.LegacyConfig.Environment.EnvironmentVariables()

podSpec := g.LegacyConfig.Environment.PodSpec()
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be super safe here, I don't think this can be nil but inside MergeIntoExpConfig there is a nil check for it.

Suggested change
if g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {
if g.Base.TaskContainerDefaults != nil && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {

Copy link
Contributor Author

@ShreyaLnuHpe ShreyaLnuHpe May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As g.Base.TaskContainerDefaults is of type mismatch with nil, it will be better to create a new func to check if it is nil. Something like this:

if !g.Base.TaskContainerDefaults.IsNil() && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil

func (c *TaskContainerDefaultsConfig) IsNil() bool {  
	return c == nil
}

Other option we have is to leverage the validate function and check for err != nil. Downside will be that err might not be empty because of other parameters being checked within that function.

err := g.Base.TaskContainerDefaults.Validate()
if err != nil && g.Base.TaskContainerDefaults.CheckpointGCPodSpec != nil {

Let me know what do you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad since g.Base.TaskContainersDefaultConfig is not a pointer we don't need to do anything

The code was good as is. I made a mistake thinking it was a pointer.

The IsNil check won't add anything since it will always not be nil.

@@ -264,6 +264,10 @@ taskContainerDefaults:
# cpuPodSpec:
# gpuPodSpec:

# Configure a custom CheckpointGC pod spec for experiments. When a custom CheckpointGC pod spec
# is defined, it takes precedence over the default CheckpointGC pod specifications. See
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the "default spec"

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last comment is about not needing the IsNil

i also requested a review from @tara-det-ai since there are doc changes

Copy link
Member

@tara-det-ai tara-det-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@NicholasBlaskey NicholasBlaskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Great work on this PR!

@ShreyaLnuHpe ShreyaLnuHpe merged commit 8a9839a into main May 31, 2024
85 of 98 checks passed
@ShreyaLnuHpe ShreyaLnuHpe deleted the shreya/addGcPodSpecField branch May 31, 2024 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants