Adapt CloseFollowerIndexIT for replicated closed indices #38767

tlrx · 2019-02-12T09:26:32Z

Note: this pull request targets the replicated-closed-indices feature branch

Now the test CloseFollowerIndexIT has been added in #38702, it needs to be adapted for replicated closed indices.

The test closes the follower index which is lagging behind the leader index. When it's closed, no sanity checks are executed because it's a follower index (this is a consequence of #38702). But with replicated closed indices, the index is reinitialized as a closed index with a NoOpEngine and such engines make strong assertions on the values of the maximum sequence number and the global checkpoint. Since the values do not match, the shards cannot be created and fail and the cluster health turns RED.

This is the expected behavior but the test now requires to be adapted to catch the following uncaught exception:

WARNING: Uncaught exception in thread: Thread[elasticsearch[follower1][generic][T#3],5,TGRP-CloseFollowerIndexIT]
java.lang.AssertionError: max seq. no. [-1] does not match [31]
	at __randomizedtesting.SeedInfo.seed([A19FDEFBFCF2B7B1]:0)
	at org.elasticsearch.index.engine.ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:141)
	at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:115)
	at org.elasticsearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:40)
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1438)
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1391)
	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:424)
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95)
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302)
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93)
	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1685)
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$8(IndexShard.java:2267)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

This pull request adapt the CloseFollowerIndexIT test so that it wraps the default UncaughtExceptionHandler with a handler that tolerates any exception thrown by ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint(). Replacing the default uncaught exception handler requires specific permissions, and instead of creating another gradle project I chose to duplicate the internalClusterTest task to make it work without security manager for this specific test only.

I tried to come up with other ways to adapt this test (disabling allocation for closed shards, wraps ReadOnlyEngine in a factory that swallows the assertion error, dedicated gradle project etc) but this one is the only solution I found that works correctly and checks the real assertion error.

elasticmachine · 2019-02-12T09:26:34Z

Pinging @elastic/es-distributed

tlrx · 2019-02-12T12:02:38Z

Unrelate test UnicastZenPingTests.testSimplePings failed, let's try again:
@elasticmachine run elasticsearch-ci/1

tlrx · 2019-02-12T15:29:58Z

@elasticmachine run elasticsearch-ci/1

martijnvg

LGTM - assuming that changing the default UncaughtExceptionHandler is the cleanest way.

x-pack/plugin/ccr/build.gradle

ywelsch · 2019-02-13T07:53:34Z

We need to think how to integrate closed replicated indices with closed follower indices. In the current form, closing follower indices will result in them not being allocated, making the cluster go red. Perhaps instead of requiring max sequence number to be equal to global checkpoint for these shards, which is not guaranteed by the CCR following logic, it would be sufficient to check that max seq number and global checkpoint agree with the primary and require a sync id marker (i.e. perform a synced flush during the VerifyShardBeforeCloseAction) so that peer recovery will be instantaneous.

tlrx · 2019-02-26T13:15:37Z

Thanks @martijnvg and @ywelsch .

Yannick and I talked via another channel and we decided to merge this change so that the CloseFollowerIndexIT test will still run after the feature branch will be merged.

We still agree that we need to think how to integrate closed replicated indices with closed follower indices and this is tracked as part of #33888.

Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888

Now the test `CloseFollowerIndexIT` has been added in elastic#38702, it needs to be adapted for replicated closed indices. The test closes the follower index which is lagging behind the leader index. When it's closed, no sanity checks are executed because it's a follower index (this is a consequence of elastic#38702). But with replicated closed indices, the index is reinitialized as a closed index with a `NoOpEngine` and such engines make strong assertions on the values of the maximum sequence number and the global checkpoint. Since the values do not match, the shards cannot be created and fail and the cluster health turns RED. This commit adapts the `CloseFollowerIndexIT` test so that it wraps the default `UncaughtExceptionHandler` with a handler that tolerates any exception thrown by `ReadOnlyEngine.assertMaxSeqNoEqualsToGlobalCheckpoint()`. Replacing the default uncaught exception handler requires specific permissions, and instead of creating another gradle project it duplicates the `internalClusterTest` task to make it work without security manager for this specific test only. Relates to elastic#33888

Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888

Adapt CloseFollowerIndexIT for replicated closed indices

ed64fe4

tlrx added >test Issues or PRs that are addressing/adding tests :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 12, 2019

tlrx requested review from martijnvg and ywelsch February 12, 2019 09:26

martijnvg approved these changes Feb 12, 2019

View reviewed changes

x-pack/plugin/ccr/build.gradle Show resolved Hide resolved

Merge branch 'replicated-closed-indices' into adapt-CloseFollowerIndexIT

dca824a

tlrx mentioned this pull request Feb 26, 2019

Replicate closed indices #33888

Closed

50 tasks

tlrx merged commit 3e61939 into elastic:replicated-closed-indices Feb 26, 2019

tlrx deleted the adapt-CloseFollowerIndexIT branch February 26, 2019 13:13

tlrx mentioned this pull request Feb 28, 2019

Add support for replicating closed indices #39499

Merged

dnhatn mentioned this pull request Apr 17, 2019

Ensure no uncommitted ops when open readonly engine #41317

Closed

dnhatn mentioned this pull request May 23, 2019

Integrate closed replicated indices with closed follower indices #42442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt CloseFollowerIndexIT for replicated closed indices #38767

Adapt CloseFollowerIndexIT for replicated closed indices #38767

tlrx commented Feb 12, 2019 •

edited

Loading

elasticmachine commented Feb 12, 2019

tlrx commented Feb 12, 2019 •

edited

Loading

tlrx commented Feb 12, 2019

martijnvg left a comment

ywelsch commented Feb 13, 2019

tlrx commented Feb 26, 2019

Adapt CloseFollowerIndexIT for replicated closed indices #38767

Adapt CloseFollowerIndexIT for replicated closed indices #38767

Conversation

tlrx commented Feb 12, 2019 • edited Loading

elasticmachine commented Feb 12, 2019

tlrx commented Feb 12, 2019 • edited Loading

tlrx commented Feb 12, 2019

martijnvg left a comment

Choose a reason for hiding this comment

ywelsch commented Feb 13, 2019

tlrx commented Feb 26, 2019

tlrx commented Feb 12, 2019 •

edited

Loading

tlrx commented Feb 12, 2019 •

edited

Loading