Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure no uncommitted ops when open readonly engine #41317

Closed
wants to merge 4 commits into from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 17, 2019

Closing a follower index can make the cluster become red because the sanity check (i.e., max_seq_no equals to the global checkpoint) does not hold for closed follow indices. The main purpose of this sanity check is to ensure that peer recovery of closed indices can safely skip phase 2 (won't replay translog operations) without losing data. We can achieve the same goal by making sure that all existing operations are committed in the last index commit.

Relates #33888
See #38767 (comment)

@dnhatn dnhatn added >enhancement :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v8.0.0 v7.2.0 labels Apr 17, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx tlrx mentioned this pull request Apr 17, 2019
50 tasks
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

* so peer recovery of closed indices can skip phase 2 (i.e., not replaying translog operations) without losing data.
*/
private void ensureNoUncommittedOperation(SeqNoStats seqNoStats, SegmentInfos segmentInfos) throws IOException {
// we can't enforce this check on an old index - should we prevent this engine as a recovery source?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would that mean? Can we restrict this more ie. doesn't it only apply if we used to be a follower engine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I pushed 8ba5ca5 to skip this check if max_seq_no equals to the global checkpoint.

this.indexCommit = Lucene.getIndexCommit(lastCommittedSegmentInfos, directory);
reader = open(indexCommit);
reader = wrapReader(reader, readerWrapperFunction);
searcherManager = new SearcherManager(reader, searcherFactory);
if (seqNoStats == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this happen ie. when is this null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run this check only when this read-only engine is the only engine opened with its shard (i.e., when obtain write lock is true).

localCheckpointTracker = createLocalCheckpointTracker(engineConfig, segmentInfos, logger,
() -> new Searcher("build_checkpoint_tracker", new IndexSearcher(reader), () -> {}), LocalCheckpointTracker::new);
}
try (Translog translog = new Translog(engineConfig.getTranslogConfig(), translogUUID, translogDeletionPolicy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit afraid of what would happen if that translog is very large? I mean we are reading the entire thing off disk no? Also isn't this expected to be empty or at least small?

Copy link
Member Author

@dnhatn dnhatn Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 8ba5ca5, we now only execute this check if we were a follower before and have gaps in sequence numbers. Even in that situation, we would read a few last translog generations since we need to read only operations since the local checkpoint of the index commit.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 26, 2019

@s1monw Thanks for looking. I've addressed your comments.

@dnhatn dnhatn requested a review from s1monw April 26, 2019 02:24
@dnhatn
Copy link
Member Author

dnhatn commented Apr 29, 2019

Discussed with @ywelsch on another channel, we preferred not to proceed with this change for it might not be sufficient for closed follower indices. We will explore another option using a read-only marker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants