Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: check non-multiples of slots per pod for kubernetes rm [MD-403] #9393

Merged
merged 6 commits into from
Jun 12, 2024

Conversation

azhou-determined
Copy link
Contributor

@azhou-determined azhou-determined commented May 20, 2024

Ticket

Error out earlier for Kubernetes clusters when slots requested are not a multiple of max_slots_per_pod

Description

Test Plan

On a kubernetes cluster with max_slots_per_pod configured as > 1, submit an experiment with slots_per_trial as a non-multiple of max_slots_per_pod (i.e. slots_per_trial=2, max_slots_per_pod=3). experiment should error out upon creation.

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@azhou-determined azhou-determined requested a review from a team as a code owner May 20, 2024 16:01
@cla-bot cla-bot bot added the cla-signed label May 20, 2024
Copy link

codecov bot commented May 20, 2024

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 49.02%. Comparing base (de03909) to head (d229e78).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9393      +/-   ##
==========================================
+ Coverage   48.99%   49.02%   +0.02%     
==========================================
  Files        1235     1235              
  Lines      160191   160199       +8     
  Branches     2780     2780              
==========================================
+ Hits        78482    78533      +51     
+ Misses      81534    81491      -43     
  Partials      175      175              
Flag Coverage Δ
backend 43.91% <95.45%> (+0.09%) ⬆️
harness 63.96% <ø> (ø)
web 44.12% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/rm/kubernetesrm/resource_pool.go 73.56% <100.00%> (+1.02%) ⬆️
...nal/rm/kubernetesrm/kubernetes_resource_manager.go 30.92% <80.00%> (+2.62%) ⬆️

... and 8 files with indirect coverage changes

Copy link

netlify bot commented May 20, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit d229e78
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6668d9db2290e50008a08881

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo style feedback

@@ -736,6 +736,55 @@ func TestROCmPodsService(t *testing.T) {
}
}

func TestValidateResources(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -269,6 +269,10 @@ func (k *kubernetesResourcePool) ValidateResources(
k.reschedule = true

fulfillable := k.maxSlotsPerPod >= msg.Slots

if !msg.IsSingleNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it'd be nice to have this return specific errors rather than the caller trying to describe what could possibly go wrong in this function.

and then assignResources could reuse the error defs, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i thought about this but didn't want to get too carried away with refactors. anyway, good call, done.

@azhou-determined azhou-determined merged commit 3afe5df into main Jun 12, 2024
84 of 97 checks passed
@azhou-determined azhou-determined deleted the krm-validate-slots branch June 12, 2024 17:12
@azhou-determined azhou-determined restored the krm-validate-slots branch June 12, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants