Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: revert multirm refactors #8962

Merged
merged 4 commits into from
Mar 7, 2024
Merged

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented Mar 6, 2024

This reverts commit c3012ff, 6857ecf, 7c6bec9.

Description

Since we're now requiring globally unique resource pool names, we can remove the refactors that added an extra Resource Manager field to structs.

Test Plan

Pass existing tests.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

RM-73

@cla-bot cla-bot bot added the cla-signed label Mar 6, 2024
Copy link

netlify bot commented Mar 6, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 12819be
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/65e9da1d051e880008eb662c

This reverts commit c3012ff.

Revert "chore: refactor ResourceManager interface for multirm (#8847)"

This reverts commit 6857ecf.

Revert "chore: refactor proto, schema, and jobservice for multiRM (#8875)"

This reverts commit 7c6bec9.
@carolinaecalderon carolinaecalderon marked this pull request as ready for review March 6, 2024 18:38
@carolinaecalderon carolinaecalderon requested review from a team as code owners March 6, 2024 18:38
Copy link
Contributor

@kkunapuli kkunapuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

codecov bot commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 29.50820% with 258 lines in your changes are missing coverage. Please review.

Project coverage is 47.48%. Comparing base (68017dd) to head (12819be).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8962      +/-   ##
==========================================
- Coverage   47.55%   47.48%   -0.07%     
==========================================
  Files        1168     1168              
  Lines      176706   176286     -420     
  Branches     2356     2356              
==========================================
- Hits        84026    83714     -312     
+ Misses      92522    92414     -108     
  Partials      158      158              
Flag Coverage Δ
backend 42.47% <29.90%> (-0.28%) ⬇️
harness 64.01% <0.00%> (+0.07%) ⬆️
web 42.86% <28.57%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/api_command.go 47.08% <100.00%> (-0.26%) ⬇️
master/internal/api_tasks.go 47.83% <100.00%> (ø)
master/internal/checkpoint_gc.go 65.93% <100.00%> (-2.11%) ⬇️
master/internal/command/command.go 67.82% <100.00%> (-0.28%) ⬇️
master/internal/config/resource_config.go 94.64% <ø> (ø)
master/internal/config/resource_manager_config.go 71.59% <ø> (ø)
master/internal/core_experiment.go 61.16% <100.00%> (-0.18%) ⬇️
master/internal/rm/agentrm/agents.go 63.30% <ø> (ø)
master/internal/rm/kubernetesrm/pods.go 19.07% <ø> (ø)
master/internal/sproto/jobs.go 17.39% <ø> (ø)
... and 36 more

... and 3 files with indirect coverage changes

@@ -52,13 +52,6 @@
"maximum": 99,
"default": null
},
"resource_manager": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have expected that this results in an invalid experiment config that had a section

resources:
  resource_manager: 'agent'

but when I try running an experiment with a section like this it seems to launch just fine. Is this patch backwards compatible in a way I don't understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh I'll double-check -- I ran 'git revert' on the three commits I wanted to revert, so this is unexpected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change wasn't actually publicly released. does this cause any issues..

are there places with serialize then deserialize an experiment config that are going to pick this up and then explode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carolinaecalderon to test this you may want to run an experiment before this revert, pause it, shutdown your cluster, add this revert, recompile, restart the cluster and see if resuming the experiment works.

if that works i think this is safe even though it isn't actually backward compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wes-turner it's possible that when you tested this I hadn't re-generated the schemas, because when I run a test following @stoksc 's proposed flow, I get this error:
ERRO[2024-03-06T15:15:46-05:00] failed to restore experiment: 1 error="cannot restore experiment 1 with unparsable config"
tldr; the change is not backwards compatible since the experiment can't resume. However, since this change was not announced to the customer and is being reverted one release later, would it be fine to remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait no, that is exactly the error I was concerned about. You launched an experiment that did not use the multirm features and upgraded and it broke.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change wasn't actually publicly released. does this cause any issues..

Nice, nice, nice. No issues, no deserialization that I know about.

Thanks. The public deprecation was exactly my concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that doing this does break experiment restore, so @carolinaecalderon is going to put up a fix. Really good you asked this @wes-turner, thanks.

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@wes-turner wes-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@wes-turner wes-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just BTW, I don't think we need backwards compatibility since this wasn't publicly released.

@carolinaecalderon
Copy link
Contributor Author

carolinaecalderon commented Mar 7, 2024

Just BTW, I don't think we need backwards compatibility since this wasn't publicly released.

I think we do, because the original PR was included in 0.29.0 release. To fix this, I added a new db migration to drop the 'resource_manager' column from any experiments that might have it

@carolinaecalderon carolinaecalderon merged commit bf21896 into main Mar 7, 2024
82 of 87 checks passed
@carolinaecalderon carolinaecalderon deleted the carolinac/revert-rm-changes branch March 7, 2024 15:35
@@ -0,0 +1,2 @@
UPDATE experiments
SET config = config #- '{resource_manager}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this down migration doesn't make sense to me, can you explain it? i don't think the down migrations matter, but i'm curious.

maxrussell pushed a commit that referenced this pull request Mar 21, 2024
* Revert "chore: add multirm module to ResourceManager (#8857)"

This reverts commit c3012ff.

Revert "chore: refactor ResourceManager interface for multirm (#8847)"

This reverts commit 6857ecf.

Revert "chore: refactor proto, schema, and jobservice for multiRM (#8875)"

This reverts commit 7c6bec9.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants