-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: master config updates for multirm [RM-3, RM-4, RM-5, RM-7, RM-29] #8831
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8831 +/- ##
==========================================
+ Coverage 47.49% 47.55% +0.06%
==========================================
Files 1065 1065
Lines 170236 170298 +62
Branches 2241 2241
==========================================
+ Hits 80847 80987 +140
+ Misses 89231 89153 -78
Partials 158 158
Flags with carried forward coverage won't be shown. Click here to find out more.
|
d3c2e54
to
f56d5c5
Compare
73157b4
to
7be1e36
Compare
@@ -10,33 +11,87 @@ import ( | |||
"github.com/determined-ai/determined/master/pkg/union" | |||
) | |||
|
|||
const defaultResourcePoolName = "default" | |||
const ( | |||
defaultResourceManagerName = "defaultrm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on this name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we abbreviate this to defaultRMName
and add a comment above it that this refers to Resource Manager?
Same below with 'defaultRPName'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I meant the value "defaultrm"
that will be shown in the webui for single resource manager clusters
Do we think defaultrm
is fine? Or maybe just default
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I prefer default
-- but either is fine. @stoksc ?
} | ||
|
||
// Pools returns pools for config. | ||
func (r *ResourceManagerConfigV1) Pools() []ResourcePoolConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I didn't just put ResourcePools
or Name
on ResourceManagerConfigV1
"side note: if you can avoid common fields on union structs, you should. E.g. I would recommend avoiding putting AdditionalFluentOutputs where it is; if it's common to both subfields, just put it in both subfields. But common fields on union structs are really hard to deal with in json-schema, and all-around just add mechanics to the system that are hard to think about."
This is a 2 year old unrelated thread from Ryan, so I don't know if it still makes sense to follow this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool that you remembered that the thread could be applicable here -- I don't see a problem with following it
@@ -58,13 +113,17 @@ type AgentResourceManagerConfig struct { | |||
// Deprecated: use DefaultComputeResourcePool instead. | |||
DefaultGPUResourcePool string `json:"default_gpu_resource_pool,omitempty"` | |||
|
|||
Name string `json:"name"` | |||
Metadata string `json:"metadata"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we enforce anything about metadata? Like it being valid json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd default to following the precedent set by the RawAttributes field in the SCIMUser struct
Looks like RawAttributes map[string]interface{}
bun:"raw_attributes" json:"-"``, and looks like we do require it to be valid json
if err = json.Unmarshal(body, &u.RawAttributes); err != nil {
return nil, newBadRequestError(err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made metadata a map[string]any
5343ec6
to
8624b04
Compare
3f39ba1
to
20850d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I have a few questions, just for my understanding.
var configCopy Config | ||
if err := copier.CopyWithOption( | ||
&configCopy, &c, copier.Option{DeepCopy: true, IgnoreEmpty: true}, | ||
); err != nil { | ||
return nil, fmt.Errorf("copying config: %w", err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why do you want to copy the config before you make it printable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this code works before this change is kinda a minefield for a terrible bug
We are editing a shallow copy of the config so it is really easy for us to accidentally overwrite the actual config. I think masking the other pools is pretty subtle to do on a shallow copy.
// We must resolve resources before we apply pool defaults. | ||
if err := c.ResolveResource(); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why did we previously have this change after applying pool defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before it just didn't matter because applyBackwardsCompatibility
would add a default config/scheduler so you would never hit the code path of ResolveResources
defaultAgentRM
.
I think this also is still the case, but it came up in my tests and I think it is more correct like this.
case rp.AgentReattachEnabled && c.ResourceManager.KubernetesRM != nil: | ||
errs = append(errs, fmt.Errorf( | ||
"agent_reattach_enabled does not impact Kubernetes resources behavior; "+ | ||
"reattach is always enabled for Kubernetes resource pools", | ||
)) | ||
case rp.AgentReattachEnabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to get rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning will still be there but it will just be
"agent_reattach_enabled is set for resource pool %s but will be ignored; "+
"as of 0.21.0 this feature is always on", rp.PoolName,
instead of a specific message for Kubernetes
I can change it back but it has been a long time so I think it unnecessary to have a different message for k8s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm modulo feedback
@@ -171,6 +171,12 @@ stringData: | |||
default_aux_resource_pool: {{.Values.defaultAuxResourcePool}} | |||
default_compute_resource_pool: {{.Values.defaultComputeResourcePool}} | |||
|
|||
|
|||
{{- if .Values.additional_resource_managers}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget a ticket to add this config to the Values.yaml, commented out, as an example (as you pointed out, once we are feature complete)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master/internal/config/config.go
Outdated
} | ||
if maxSlotsPerPod := *c.ResourceManager.KubernetesRM.MaxSlotsPerPod; maxSlotsPerPod < 0 { | ||
return fmt.Errorf("max_slots_per_pod must be >= 0 got %d", maxSlotsPerPod) | ||
// TODO(multirm) change behavior on max slots per pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link ticket
)) | ||
case rp.AgentReattachEnabled: | ||
for _, r := range c.ResourceManagers() { | ||
for _, rp := range r.ResourcePools { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should put the if reattach enable is true check back in so we don't print deprecations for each pool no matter the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is why we should write those super simple unit tests
master/internal/config/config.go
Outdated
@@ -438,7 +451,11 @@ func ReadRMPreemptionStatus(rpName string) bool { | |||
} | |||
|
|||
func readRMPreemptionStatus(config *Config, rpName string) bool { | |||
for _, rpConfig := range config.ResourcePools { | |||
// TODO(multirm) make this be correct for len(resourceManagers) > 0 by taking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ticket this
} else { | ||
poolNames[rp.PoolName] = true | ||
agentRMCount := 0 | ||
seenResourceManagerNames := map[string]bool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: map style and naming style not consistent with code around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style guide agrees with make
for empty maps
https://github.com/uber-go/guide/blob/master/style.md#initializing-maps
@@ -60,6 +86,9 @@ type AgentResourceManagerConfig struct { | |||
|
|||
RequireAuthentication bool `json:"require_authentication"` | |||
ClientCA string `json:"client_ca"` | |||
|
|||
Name string `json:"name"` | |||
Metadata map[string]any `json:"metadata"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe just map[string]string
im trying to avoid eventually seeing rm.Metadata['my-field'].(type)
@@ -133,6 +163,9 @@ type KubernetesResourceManagerConfig struct { | |||
DefaultAuxResourcePool string `json:"default_aux_resource_pool"` | |||
DefaultComputeResourcePool string `json:"default_compute_resource_pool"` | |||
NoDefaultResourcePools bool `json:"no_default_resource_pools"` | |||
|
|||
Name string `json:"name"` | |||
Metadata map[string]any `json:"metadata"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
master/internal/core.go
Outdated
c, ok := m.config.GetAgentRMConfig() | ||
agentRM := c.ResourceManager.AgentRM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very uncomfortable that there is an access into c
before ok
is checked
err = m.checkIfRMDefaultsAreUnbound(m.config.ResourceManager) | ||
if err != nil { | ||
return fmt.Errorf("could not validate cluster default resource pools: %s", err.Error()) | ||
for _, r := range m.config.ResourceManagers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably discuss defaulting with product
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm down to bring it up
I do think the simplest solution is to just default to the root level resource manager, default_compute_pool / default_aux_pool.
e41a90b
to
4482e95
Compare
Description
Does master config migrations for multirm
Adds
name
andmetadata
to resource managers configMoves rm config from
resource_manager
toresource_managers
and is now a listMoves
resource_pools
from root level to under each element of theresource_managers
array.Does this in a backwards compatible way.
Test Plan
Should be covered by unit tests.
For helm changes use this values and ensure that the master fails with
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket