Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Pause & Resume run #9129

Merged
merged 22 commits into from
Jun 6, 2024
Merged

feat: Pause & Resume run #9129

merged 22 commits into from
Jun 6, 2024

Conversation

AmanuelAaron
Copy link
Contributor

@AmanuelAaron AmanuelAaron commented Apr 9, 2024

Description

Adding pause & resume run endpoints to Runs API

Test Plan

e2e tests should pass

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

ET-91

@cla-bot cla-bot bot added the cla-signed label Apr 9, 2024
Copy link

netlify bot commented Apr 9, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 49eff23
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6660b0d86d21f10008065894

@AmanuelAaron AmanuelAaron changed the title feat: Pause run feat: Pause & Resume run Apr 9, 2024
@AmanuelAaron AmanuelAaron marked this pull request as ready for review April 9, 2024 15:25
@AmanuelAaron AmanuelAaron requested a review from a team as a code owner April 9, 2024 15:25
Copy link

codecov bot commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 14.54545% with 141 lines in your changes missing coverage. Please review.

Project coverage is 48.96%. Comparing base (13a5142) to head (49eff23).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9129      +/-   ##
==========================================
- Coverage   49.00%   48.96%   -0.04%     
==========================================
  Files        1234     1234              
  Lines      159671   159836     +165     
  Branches     2779     2778       -1     
==========================================
+ Hits        78248    78268      +20     
- Misses      81248    81393     +145     
  Partials      175      175              
Flag Coverage Δ
backend 43.69% <0.00%> (-0.10%) ⬇️
harness 63.99% <35.29%> (-0.05%) ⬇️
web 44.12% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
harness/determined/common/api/bindings.py 40.37% <35.29%> (-0.04%) ⬇️
master/internal/api_runs.go 62.30% <0.00%> (-11.26%) ⬇️

... and 5 files with indirect coverage changes


var results []*apiv1.RunActionResult
visibleIDs := set.New[int32]()
expIDs := set.New[int32]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pausing a run is not the same as pausing a search.

we need to go back to product about this feature. as is it doesn't make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion we will only be pausing runs part of single-trial experiments

@stoksc
Copy link
Contributor

stoksc commented May 14, 2024

@AmanuelAaron are we going to close this PR?

Column("r.id").
ColumnExpr("COALESCE((r.archived OR e.archived OR p.archived OR w.archived), FALSE) AS archived").
ColumnExpr("r.experiment_id as exp_id").
ColumnExpr("((SELECT COUNT(*) FROM runs r WHERE e.id = r.experiment_id) > 1) as is_multitrial").
Copy link
Contributor

@stoksc stoksc Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to look at the searcher type not at the number of runs. asha doesn't create all its trials at once. also it have a side benefit that it's not an agg and should be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably make sure searchRuns also behaves like this so the value is consistent throughout. should i make a ticket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo final feedback (the comment about checking searcher type instead should be considered blocking but i dont feel the need to re-review how it is implemented).

i don't quite agree with the bff style here, i think we'd be better having pause/resume on the runs table do a search runs and then call bulk resume/pause experiments when the experiment ids for the single serachers directly. but we can agree to disagree.

}
expIDs.Insert(*cand.ExpID)
}
if filter == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes intentions clearer i think

Suggested change
if filter == nil {
if len(runIds) != 0 {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a shared bool as there is another similar check above

var results []*apiv1.RunActionResult
visibleIDs := set.New[int32]()
expIDs := set.New[int32]()
expToRun := make(map[int32][]int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be a single mapping or there is an error?

Suggested change
expToRun := make(map[int32][]int32)
expToRun := make(map[int32]int32)

making this clear in the code here would simplify code later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over from when we allowed multi-trial. Refactored now, including a few things below.

for _, expRes := range expResults {
val, ok := expToRun[expRes.ID]
if !ok {
val = []int32{-1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 seems much less useful than a message saying "some invariant in the code was broken and we don't to an action on an experiment we didn't expect to"..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@AmanuelAaron AmanuelAaron merged commit 2588eea into main Jun 6, 2024
81 of 97 checks passed
@AmanuelAaron AmanuelAaron deleted the pause-run branch June 6, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants