Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: master config updates for multirm [RM-3, RM-4, RM-5, RM-7, RM-29] #8831

Merged
merged 5 commits into from
Feb 23, 2024

Conversation

NicholasBlaskey
Copy link
Contributor

@NicholasBlaskey NicholasBlaskey commented Feb 12, 2024

Description

Does master config migrations for multirm

  1. Adds name and metadata to resource managers config

  2. Moves rm config from resource_manager to resource_managers and is now a list

  3. Moves resource_pools from root level to under each element of the resource_managers array.

Does this in a backwards compatible way.

Test Plan

Should be covered by unit tests.

For helm changes use this values and ensure that the master fails with

panic: expected only one resource manager got 3 instead
maxSlotsPerPod: 1
additional_resource_managers:
  - resource_manager:
      name: nick
      type: kubernetes
      max_slots_per_pod: 2
    resource_pools:
      - pool_name: test

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@cla-bot cla-bot bot added the cla-signed label Feb 12, 2024
Copy link

netlify bot commented Feb 12, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 4482e95
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65d8b4b4664c630007caa8ae
😎 Deploy Preview https://deploy-preview-8831--determined-ui.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Feb 12, 2024

Codecov Report

Attention: Patch coverage is 80.66667% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 47.55%. Comparing base (72d54be) to head (4482e95).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8831      +/-   ##
==========================================
+ Coverage   47.49%   47.55%   +0.06%     
==========================================
  Files        1065     1065              
  Lines      170236   170298      +62     
  Branches     2241     2241              
==========================================
+ Hits        80847    80987     +140     
+ Misses      89231    89153      -78     
  Partials      158      158              
Flag Coverage Δ
backend 43.37% <80.66%> (+0.30%) ⬆️
harness 63.79% <ø> (ø)
web 42.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/experiment.go 26.70% <100.00%> (ø)
...ster/internal/rm/agentrm/agent_resource_manager.go 18.56% <ø> (ø)
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 0.00% <ø> (ø)
master/internal/spec_util.go 56.32% <100.00%> (-0.99%) ⬇️
master/internal/config/resource_config.go 94.64% <94.11%> (+47.02%) ⬆️
master/internal/rm/agentrm/agents.go 32.43% <0.00%> (-0.91%) ⬇️
master/internal/config/resource_manager_config.go 71.59% <73.68%> (+27.14%) ⬆️
master/internal/config/config.go 59.64% <89.06%> (+11.68%) ⬆️
master/internal/core.go 2.26% <0.00%> (-0.01%) ⬇️

... and 7 files with indirect coverage changes

@NicholasBlaskey NicholasBlaskey changed the title Nickb multirm config chore: support multiple resource manage config items [DET-10140, DET-10140] Feb 12, 2024
@NicholasBlaskey NicholasBlaskey changed the title chore: support multiple resource manage config items [DET-10140, DET-10140] chore: support multiple resource manage config items [DET-10140, DET-10141] Feb 13, 2024
@NicholasBlaskey NicholasBlaskey changed the title chore: support multiple resource manage config items [DET-10140, DET-10141] chore: master config updates for multirm [DET-10140, DET-10141, DET-10142, DET-10143] Feb 13, 2024
@NicholasBlaskey NicholasBlaskey changed the title chore: master config updates for multirm [DET-10140, DET-10141, DET-10142, DET-10143] chore: master config updates for multirm [RM-3, RM-4, RM-5, RM-7] Feb 16, 2024
@NicholasBlaskey NicholasBlaskey marked this pull request as ready for review February 16, 2024 18:50
@NicholasBlaskey NicholasBlaskey requested a review from a team as a code owner February 16, 2024 18:50
@@ -10,33 +11,87 @@ import (
"github.com/determined-ai/determined/master/pkg/union"
)

const defaultResourcePoolName = "default"
const (
defaultResourceManagerName = "defaultrm"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on this name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we abbreviate this to defaultRMName and add a comment above it that this refers to Resource Manager?
Same below with 'defaultRPName'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant the value "defaultrm" that will be shown in the webui for single resource manager clusters

Do we think defaultrm is fine? Or maybe just default?

Copy link
Contributor

@carolinaecalderon carolinaecalderon Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I prefer default -- but either is fine. @stoksc ?

}

// Pools returns pools for config.
func (r *ResourceManagerConfigV1) Pools() []ResourcePoolConfig {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I didn't just put ResourcePools or Name on ResourceManagerConfigV1

"side note: if you can avoid common fields on union structs, you should. E.g. I would recommend avoiding putting AdditionalFluentOutputs where it is; if it's common to both subfields, just put it in both subfields. But common fields on union structs are really hard to deal with in json-schema, and all-around just add mechanics to the system that are hard to think about."

This is a 2 year old unrelated thread from Ryan, so I don't know if it still makes sense to follow this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool that you remembered that the thread could be applicable here -- I don't see a problem with following it

@@ -58,13 +113,17 @@ type AgentResourceManagerConfig struct {
// Deprecated: use DefaultComputeResourcePool instead.
DefaultGPUResourcePool string `json:"default_gpu_resource_pool,omitempty"`

Name string `json:"name"`
Metadata string `json:"metadata"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we enforce anything about metadata? Like it being valid json?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd default to following the precedent set by the RawAttributes field in the SCIMUser struct
Looks like RawAttributes map[string]interface{} bun:"raw_attributes" json:"-"``, and looks like we do require it to be valid json

	if err = json.Unmarshal(body, &u.RawAttributes); err != nil {
		return nil, newBadRequestError(err)
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made metadata a map[string]any

@NicholasBlaskey NicholasBlaskey changed the title chore: master config updates for multirm [RM-3, RM-4, RM-5, RM-7] chore: master config updates for multirm [RM-3, RM-4, RM-5, RM-7, RM-29] Feb 20, 2024
Copy link
Contributor

@carolinaecalderon carolinaecalderon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have a few questions, just for my understanding.

Comment on lines +192 to +198
var configCopy Config
if err := copier.CopyWithOption(
&configCopy, &c, copier.Option{DeepCopy: true, IgnoreEmpty: true},
); err != nil {
return nil, fmt.Errorf("copying config: %w", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do you want to copy the config before you make it printable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this code works before this change is kinda a minefield for a terrible bug

We are editing a shallow copy of the config so it is really easy for us to accidentally overwrite the actual config. I think masking the other pools is pretty subtle to do on a shallow copy.

Comment on lines +257 to +259
// We must resolve resources before we apply pool defaults.
if err := c.ResolveResource(); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why did we previously have this change after applying pool defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before it just didn't matter because applyBackwardsCompatibility would add a default config/scheduler so you would never hit the code path of ResolveResources defaultAgentRM.

I think this also is still the case, but it came up in my tests and I think it is more correct like this.

Comment on lines -299 to -304
case rp.AgentReattachEnabled && c.ResourceManager.KubernetesRM != nil:
errs = append(errs, fmt.Errorf(
"agent_reattach_enabled does not impact Kubernetes resources behavior; "+
"reattach is always enabled for Kubernetes resource pools",
))
case rp.AgentReattachEnabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want to get rid of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning will still be there but it will just be

				"agent_reattach_enabled is set for resource pool %s but will be ignored; "+
					"as of 0.21.0 this feature is always on", rp.PoolName,

instead of a specific message for Kubernetes

I can change it back but it has been a long time so I think it unnecessary to have a different message for k8s

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo feedback

helm/charts/determined/templates/master-config.yaml Outdated Show resolved Hide resolved
@@ -171,6 +171,12 @@ stringData:
default_aux_resource_pool: {{.Values.defaultAuxResourcePool}}
default_compute_resource_pool: {{.Values.defaultComputeResourcePool}}


{{- if .Values.additional_resource_managers}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget a ticket to add this config to the Values.yaml, commented out, as an example (as you pointed out, once we are feature complete)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
if maxSlotsPerPod := *c.ResourceManager.KubernetesRM.MaxSlotsPerPod; maxSlotsPerPod < 0 {
return fmt.Errorf("max_slots_per_pod must be >= 0 got %d", maxSlotsPerPod)
// TODO(multirm) change behavior on max slots per pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link ticket

))
case rp.AgentReattachEnabled:
for _, r := range c.ResourceManagers() {
for _, rp := range r.ResourcePools {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put the if reattach enable is true check back in so we don't print deprecations for each pool no matter the config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is why we should write those super simple unit tests

@@ -438,7 +451,11 @@ func ReadRMPreemptionStatus(rpName string) bool {
}

func readRMPreemptionStatus(config *Config, rpName string) bool {
for _, rpConfig := range config.ResourcePools {
// TODO(multirm) make this be correct for len(resourceManagers) > 0 by taking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticket this

} else {
poolNames[rp.PoolName] = true
agentRMCount := 0
seenResourceManagerNames := map[string]bool{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: map style and naming style not consistent with code around it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -60,6 +86,9 @@ type AgentResourceManagerConfig struct {

RequireAuthentication bool `json:"require_authentication"`
ClientCA string `json:"client_ca"`

Name string `json:"name"`
Metadata map[string]any `json:"metadata"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just map[string]string

im trying to avoid eventually seeing rm.Metadata['my-field'].(type)

@@ -133,6 +163,9 @@ type KubernetesResourceManagerConfig struct {
DefaultAuxResourcePool string `json:"default_aux_resource_pool"`
DefaultComputeResourcePool string `json:"default_compute_resource_pool"`
NoDefaultResourcePools bool `json:"no_default_resource_pools"`

Name string `json:"name"`
Metadata map[string]any `json:"metadata"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 641 to 642
c, ok := m.config.GetAgentRMConfig()
agentRM := c.ResourceManager.AgentRM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very uncomfortable that there is an access into c before ok is checked

err = m.checkIfRMDefaultsAreUnbound(m.config.ResourceManager)
if err != nil {
return fmt.Errorf("could not validate cluster default resource pools: %s", err.Error())
for _, r := range m.config.ResourceManagers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably discuss defaulting with product

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm down to bring it up

I do think the simplest solution is to just default to the root level resource manager, default_compute_pool / default_aux_pool.

@NicholasBlaskey NicholasBlaskey merged commit 7bb9dbc into main Feb 23, 2024
81 of 91 checks passed
@NicholasBlaskey NicholasBlaskey deleted the nickb_multirm_config branch February 23, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants