Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][Transforms] add wait_for_checkpoint flag to stop #47935

Conversation

benwtrent
Copy link
Member

This adds the new flag wait_for_checkpoint to _stop.

This is a new version of the much older (now closed) PR: #45469

The state is all persisted to the index and relies on optimistic concurrency controls.

I thought a bit more about the default value being set to true and that choice continues to bug me a little bit. It sort of switches expectations around the current behavior and to me, it seems that for BWC, and consistent behavior, the default value should be false.

closes #45293

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

@@ -14,6 +14,7 @@
import java.util.Locale;

public enum TransformTaskState implements Writeable {
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this earlier than 8.x given the feature is beta. It's too late for 7.5 but it would probably be worth adding STOPPING to 7.6 before making the feature GA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. those TODO's are in danger of being forgotten. If there is a possibility to do it now, let's do it.

However, I am not fully convinced. STOPPING is usually used when there are technical reasons, e.g. if we need an intermediate state between OPEN/STARTED/RUNNING and STOPPED, usually due to resources that require time to stop. This isn't the case here, but this is a feature.

With other words, I find the current solution more appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we return something in _stats after a call to _stop?wait_for_checkpoint=true ?

TransformStats.State returns a user readable state and I think it still returns INDEXING instead of STOPPING.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs the state object will have a new field called should_stop_at_checkpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransformStats.State is what we return to the user, it's a combined value made from the indexer and task state. TransformStateis only written to/ read from the index, it's not exposed to the user since the stats endpoint has been rewritten. We could also expose a new field, but I think this isn't needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs

So you want to adjust TransformStats.State to say STOPPING when should_stop_at_checkpoint == true. Or STOPPING_AT_CHECKPOINT? Will need to think on this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start with STOPPING, note that we can revisit that decision. If we expose it as STOPPING_AT_CHECKPOINT it's harder to go back. I doubt that STOPPING_AT_CHECKPOINT is useful from a user perspective. For similar reasons we made the decision to hide indexer and task state and provide a simplified state value.

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can't wait to finally have this.

I added some thoughts I think we need to agree on to go on with this.

@@ -28,13 +28,15 @@ protected RestChannelConsumer prepareRequest(RestRequest restRequest, NodeClient
boolean waitForCompletion = restRequest.paramAsBoolean(TransformField.WAIT_FOR_COMPLETION.getPreferredName(), false);
boolean force = restRequest.paramAsBoolean(TransformField.FORCE.getPreferredName(), false);
boolean allowNoMatch = restRequest.paramAsBoolean(TransformField.ALLOW_NO_MATCH.getPreferredName(), false);
boolean waitForCheckpoint = restRequest.paramAsBoolean(TransformField.WAIT_FOR_CHECKPOINT.getPreferredName(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should disallow the combination force and waitForCheckpoint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force pretty much negates the need for a waitForCheckpoint. So, having a force but when the transform is NOT failed, should essentially be a no-op for force. I think reasons for this are laid out in the source issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fully agree. My argument is that something like _stop?force=true&waitForCheckpoint=true makes no sense. You only need force if the transform is failed but if it's failed you can not wait for a checkpoint and it does not make sense to call force if the transform isn't in a failed state. -> The parameters are mutually exclusive.

Now we can either simply ignore that or we disallow the combination. The problem is that if we disallow we should also disallow using force if the transform isn't failed. I am not sure if we have precedence here, I can only think of force, e.g. in ML jobs. I think we are lenient there, too?

@@ -54,6 +54,7 @@
* progress::docs_processed, progress::docs_indexed,
* stats::exponential_avg_checkpoint_duration_ms, stats::exponential_avg_documents_indexed,
* stats::exponential_avg_documents_processed
* version 3 (7.6): state::should_stop_at_checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs changes in TransformInternalIndexConstants, to keep the numbering despite the big rename I upped it there to "3" (003) for 7.5, but I haven't updated the history entry here, sorry.

So, here it should be version 4 and TransformInternalIndexConstants.INDEX_VERSION should become 004

}

public StopTransformRequest(String id, Boolean waitForCompletion, TimeValue timeout) {
public StopTransformRequest(String id, Boolean waitForCompletion, TimeValue timeout, Boolean waitForCheckpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly I see 1 potential problem:

  1. call _stop?wait_for_completion=true&wait_for_checkpoint=true

this lets the call block

  1. call _stop?wait_for_checkpoint=false

that's fine, you should be able to switch wait_for_checkpoint by calling _stop again, however: the call to 1 will return after this call and there is no indication that we did not stop at a checkpoint. I think we should add a field to the response object noting whether the api has stopped at a checkpoint or not.

(In that respect I also wonder if it should be possible to revert the decission of wait_for_checkpoint by a call to _start)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also be in line with a general improvement of Response objects (probably better placed in a separate PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs I agree on indicating if the stop was indeed at a checkpoint or not. That information would be available via a _stats call, but I suppose we can add something here.

As for negating with a call to _start, I do not want to complicate these interactions any further. I am against it. If somebody wants to start it again, they should call _stop?wait_for_checkpoint=false and then _start again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to think about being able to alert the user on if the _stop indeed caused the transform to stop at a checkpoint or not.

when wait_for_completion is spinning, it is checking on every cluster state update. Since we don't store this state information inside cluster state, there is nothing indicating if the value changed or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmuhs after digging around, I don't think this is possible. There are no hooks into how the task is cleared and we consider it "stopped" when it is deleted, consequently losing insight into how it was stopped.

@@ -14,6 +14,7 @@
import java.util.Locale;

public enum TransformTaskState implements Writeable {
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. those TODO's are in danger of being forgotten. If there is a possibility to do it now, let's do it.

However, I am not fully convinced. STOPPING is usually used when there are technical reasons, e.g. if we need an intermediate state between OPEN/STARTED/RUNNING and STOPPED, usually due to resources that require time to stop. This isn't the case here, but this is a feature.

With other words, I find the current solution more appropriate.

@@ -14,6 +14,7 @@
import java.util.Locale;

public enum TransformTaskState implements Writeable {
// TODO 8.x add a `STOPPING` state and BWC handling in ::fromString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we return something in _stats after a call to _stop?wait_for_checkpoint=true ?

TransformStats.State returns a user readable state and I think it still returns INDEXING instead of STOPPING.

));
}

protected void doSaveState(TransformState state, ActionListener<Void> listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this somehow different and make it private to avoid confusing it with the real doSaveState?

Copy link
Member Author

@benwtrent benwtrent Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think that not being an Override and having different parameters is enough?

I will look into making it private

transformTask.stop(request.isForce(), request.isWaitForCheckpoint());
listener.onResponse(new Response(true));
} catch (ElasticsearchException ex) {
listener.onFailure(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as this causes often trouble, an explicit return would be good although superfluous logically here.

@benwtrent benwtrent added v7.6.0 and removed v7.5.0 labels Oct 16, 2019
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc
run elasticsearch-ci/default-distro

…b.com:benwtrent/elasticsearch into feature/ml-transform-wait_for_checkpoint-flag
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 1 nit: I think we decided to go without STOPPING, so we can remove the todo's?

@@ -43,6 +43,9 @@
@Nullable
private NodeAttributes node;

// TODO: 8.x this needs to be deprecated and we move towards a STOPPING TASK_STATE
private final boolean shouldStopAtNextCheckpoint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the todo can be removed?

@benwtrent benwtrent merged commit 451a5c0 into elastic:master Oct 28, 2019
@benwtrent benwtrent deleted the feature/ml-transform-wait_for_checkpoint-flag branch October 28, 2019 15:21
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Oct 28, 2019
benwtrent added a commit that referenced this pull request Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the option to wait until the next checkpoint when stopping a transform
5 participants