-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Transforms] add wait_for_checkpoint flag to stop #47935
[ML][Transforms] add wait_for_checkpoint flag to stop #47935
Conversation
Pinging @elastic/ml-core (:ml/Transform) |
run elasticsearch-ci/2 |
@@ -14,6 +14,7 @@ | |||
import java.util.Locale; | |||
|
|||
public enum TransformTaskState implements Writeable { | |||
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this earlier than 8.x given the feature is beta. It's too late for 7.5 but it would probably be worth adding STOPPING
to 7.6 before making the feature GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. those TODO's are in danger of being forgotten. If there is a possibility to do it now, let's do it.
However, I am not fully convinced. STOPPING
is usually used when there are technical reasons, e.g. if we need an intermediate state between OPEN/STARTED/RUNNING
and STOPPED
, usually due to resources that require time to stop. This isn't the case here, but this is a feature.
With other words, I find the current solution more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Do we return something in _stats after a call to _stop?wait_for_checkpoint=true ?
TransformStats.State
returns a user readable state and I think it still returns INDEXING
instead of STOPPING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmuhs the state
object will have a new field called should_stop_at_checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransformStats.State
is what we return to the user, it's a combined value made from the indexer and task state. TransformState
is only written to/ read from the index, it's not exposed to the user since the stats endpoint has been rewritten. We could also expose a new field, but I think this isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to adjust TransformStats.State
to say STOPPING
when should_stop_at_checkpoint == true
. Or STOPPING_AT_CHECKPOINT
? Will need to think on this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would start with STOPPING
, note that we can revisit that decision. If we expose it as STOPPING_AT_CHECKPOINT
it's harder to go back. I doubt that STOPPING_AT_CHECKPOINT
is useful from a user perspective. For similar reasons we made the decision to hide indexer and task state and provide a simplified state value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can't wait to finally have this.
I added some thoughts I think we need to agree on to go on with this.
@@ -28,13 +28,15 @@ protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient | |||
boolean waitForCompletion = restRequest.paramAsBoolean(TransformField.WAIT_FOR_COMPLETION.getPreferredName(), false); | |||
boolean force = restRequest.paramAsBoolean(TransformField.FORCE.getPreferredName(), false); | |||
boolean allowNoMatch = restRequest.paramAsBoolean(TransformField.ALLOW_NO_MATCH.getPreferredName(), false); | |||
boolean waitForCheckpoint = restRequest.paramAsBoolean(TransformField.WAIT_FOR_CHECKPOINT.getPreferredName(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should disallow the combination force
and waitForCheckpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force
pretty much negates the need for a waitForCheckpoint
. So, having a force
but when the transform is NOT failed, should essentially be a no-op for force
. I think reasons for this are laid out in the source issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, fully agree. My argument is that something like _stop?force=true&waitForCheckpoint=true
makes no sense. You only need force
if the transform is failed but if it's failed you can not wait for a checkpoint and it does not make sense to call force
if the transform isn't in a failed state. -> The parameters are mutually exclusive.
Now we can either simply ignore that or we disallow the combination. The problem is that if we disallow we should also disallow using force
if the transform isn't failed. I am not sure if we have precedence here, I can only think of force
, e.g. in ML jobs. I think we are lenient there, too?
@@ -54,6 +54,7 @@ | |||
* progress::docs_processed, progress::docs_indexed, | |||
* stats::exponential_avg_checkpoint_duration_ms, stats::exponential_avg_documents_indexed, | |||
* stats::exponential_avg_documents_processed | |||
* version 3 (7.6): state::should_stop_at_checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs changes in TransformInternalIndexConstants
, to keep the numbering despite the big rename I upped it there to "3" (003) for 7.5, but I haven't updated the history entry here, sorry.
So, here it should be version 4 and TransformInternalIndexConstants.INDEX_VERSION
should become 004
} | ||
|
||
public StopTransformRequest(String id, Boolean waitForCompletion, TimeValue timeout) { | ||
public StopTransformRequest(String id, Boolean waitForCompletion, TimeValue timeout, Boolean waitForCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly I see 1 potential problem:
- call _stop?wait_for_completion=true&wait_for_checkpoint=true
this lets the call block
- call _stop?wait_for_checkpoint=false
that's fine, you should be able to switch wait_for_checkpoint
by calling _stop again, however: the call to 1 will return after this call and there is no indication that we did not stop at a checkpoint. I think we should add a field to the response object noting whether the api has stopped at a checkpoint or not.
(In that respect I also wonder if it should be possible to revert the decission of wait_for_checkpoint
by a call to _start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also be in line with a general improvement of Response objects (probably better placed in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmuhs I agree on indicating if the stop was indeed at a checkpoint or not. That information would be available via a _stats call, but I suppose we can add something here.
As for negating with a call to _start
, I do not want to complicate these interactions any further. I am against it. If somebody wants to start it again, they should call _stop?wait_for_checkpoint=false
and then _start
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to think about being able to alert the user on if the _stop indeed caused the transform to stop at a checkpoint or not.
when wait_for_completion
is spinning, it is checking on every cluster state update. Since we don't store this state information inside cluster state, there is nothing indicating if the value changed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmuhs after digging around, I don't think this is possible. There are no hooks into how the task is cleared and we consider it "stopped" when it is deleted, consequently losing insight into how it was stopped.
@@ -14,6 +14,7 @@ | |||
import java.util.Locale; | |||
|
|||
public enum TransformTaskState implements Writeable { | |||
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. those TODO's are in danger of being forgotten. If there is a possibility to do it now, let's do it.
However, I am not fully convinced. STOPPING
is usually used when there are technical reasons, e.g. if we need an intermediate state between OPEN/STARTED/RUNNING
and STOPPED
, usually due to resources that require time to stop. This isn't the case here, but this is a feature.
With other words, I find the current solution more appropriate.
@@ -14,6 +14,7 @@ | |||
import java.util.Locale; | |||
|
|||
public enum TransformTaskState implements Writeable { | |||
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Do we return something in _stats after a call to _stop?wait_for_checkpoint=true ?
TransformStats.State
returns a user readable state and I think it still returns INDEXING
instead of STOPPING
.
)); | ||
} | ||
|
||
protected void doSaveState(TransformState state, ActionListener<Void> listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we name this somehow different and make it private to avoid confusing it with the real doSaveState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think that not being an Override
and having different parameters is enough?
I will look into making it private
transformTask.stop(request.isForce(), request.isWaitForCheckpoint()); | ||
listener.onResponse(new Response(true)); | ||
} catch (ElasticsearchException ex) { | ||
listener.onFailure(ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as this causes often trouble, an explicit return would be good although superfluous logically here.
@elasticmachine update branch |
run elasticsearch-ci/bwc |
…b.com:benwtrent/elasticsearch into feature/ml-transform-wait_for_checkpoint-flag
…orm-wait_for_checkpoint-flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, 1 nit: I think we decided to go without STOPPING, so we can remove the todo's?
@@ -43,6 +43,9 @@ | |||
@Nullable | |||
private NodeAttributes node; | |||
|
|||
// TODO: 8.x this needs to be deprecated and we move towards a STOPPING TASK_STATE | |||
private final boolean shouldStopAtNextCheckpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the todo can be removed?
Adds `wait_for_checkpoint` for `_stop` API.
This adds the new flag
wait_for_checkpoint
to_stop
.This is a new version of the much older (now closed) PR: #45469
The state is all persisted to the index and relies on optimistic concurrency controls.
I thought a bit more about the default value being set to
true
and that choice continues to bug me a little bit. It sort of switches expectations around the current behavior and to me, it seems that for BWC, and consistent behavior, the default value should befalse
.closes #45293