Replicate write failures #23314

areek · 2017-02-22T21:33:25Z

Currently, when a primary write operation fails after generating
a sequence number, the failure is not communicated to the replicas.
Ideally, every operation which generates a sequence number on primary
should be recorded in all replicas.

In this change, a sequence number is associated with write operation
failure. When a failure with an assinged seqence number arrives at a
replica, the failure cause and sequence number is recorded in the translog
and the sequence number is marked as completed via executing Engine.noOp
on the replica engine.

elasticmachine · 2017-02-23T18:12:16Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

bleskes · 2017-02-27T09:25:14Z

core/src/main/java/org/elasticsearch/action/bulk/BulkItemResponse.java

@@ -184,6 +192,11 @@ public Failure(StreamInput in) throws IOException {
            id = in.readOptionalString();
            cause = in.readException();
            status = ExceptionsHelper.status(cause);
+            if (in.getVersion().onOrAfter(Version.V_6_0_0_alpha1_UNRELEASED)) {
+                seqNo = in.readLong();


I think @jasontedor want's us to strandardize on readZlong. @jasontedor correct me if I'm wrong..

bleskes · 2017-02-27T09:26:33Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

@@ -309,36 +311,47 @@ private UpdateResultHolder executeUpdateRequest(UpdateRequest updateRequest, Ind
        for (int i = 0; i < request.items().length; i++) {
            BulkItemRequest item = request.items()[i];
            assert item.getPrimaryResponse() != null : "expected primary response to be set for item [" + i + "] request ["+ item.request() +"]";
-            if (item.getPrimaryResponse().isFailed() == false &&
-                    item.getPrimaryResponse().getResponse().getResult() != DocWriteResponse.Result.NOOP) {
+            if (item.getPrimaryResponse().getResponse().getResult() != DocWriteResponse.Result.NOOP) {


I wonder (though not sure) if we can check for the existence of a seqNo and that's it.

bleskes · 2017-02-27T09:29:27Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+                        if (operationResult.hasFailure()) {
+                            // check if any transient write operation failures should be bubbled up
+                            Exception failure = operationResult.getFailure();
+                            assert failure instanceof VersionConflictEngineException


FYI we don't throw version conflicts on replicas any more... (not related to your change)

bleskes · 2017-02-27T09:30:18Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

@@ -445,4 +458,12 @@ private UpdateResultHolder executeUpdateRequest(UpdateRequest updateRequest, Ind
                primaryResponse.getSeqNo(), request.primaryTerm(), version, versionType);
        return replica.delete(delete);
    }
+
+    private Engine.NoOpResult executeNoOpRequestOnReplica(BulkItemResponse.Failure primaryFailure, DocWriteRequest docWriteRequest, IndexShard replica) throws IOException {


I'm wondering if we should call this executeFailedSeqNoOnReplica .. might be a clearer name at this level?

bleskes · 2017-02-27T09:30:43Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+    private Engine.NoOpResult executeNoOpRequestOnReplica(BulkItemResponse.Failure primaryFailure, DocWriteRequest docWriteRequest, IndexShard replica) throws IOException {
+        final long version = docWriteRequest.version();
+        final VersionType versionType = docWriteRequest.versionType().versionTypeForReplicationAndRecovery();
+        final Engine.NoOp noOp = replica.prepareNoOpOnReplica(docWriteRequest.type(), docWriteRequest.id(),


why are versions relevant here?

bleskes · 2017-02-27T09:33:55Z

Thx @areek . I think this is in the right direction. I left some comments. I also think it needs some work in InternalEngine as the primary also needs to add a noop to its translog.

areek · 2017-03-29T16:40:30Z

Thanks @bleskes for the review. I updated the PR adding:

tests for replicating document failures (w/ seq no generated) as a no-op in translog in primary and replica shards
tests that request failures (without seq no generated) are just reported and not logged in the translog
Ensure ESIndexLevelReplicationTestCase use the same static functions used by TransportShardBulkAction's shardOperationOnPrimary and shardOperationOnReplica

bleskes

Thx @areek . Basics look good. I left some comments.

bleskes · 2017-04-03T09:06:11Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+    }
+
+    /** Result holder of executing shard bulk on primary */
+    public static class BulkShardResult {


Why do we need a new class? can't we use WritePrimaryResult?

I initially used WritePrimaryResult for this, but that changed the visibility of WritePrimaryResult from protected to public. Hence I introduced BulkShardResult which allows us to use TransportShardBulkAction#performOnPrimary in ESIndexLevelReplicationTestCase without exposing WritePrimaryResult

I removed the new class and exposed WritePrimaryResult instead, as discussed

bleskes · 2017-04-03T09:19:28Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+        if (primaryResponse.isFailed()) {
+            return primaryResponse.getFailure().getSeqNo() != SequenceNumbersService.UNASSIGNED_SEQ_NO;
+        } else {
+            // NOTE: for bwc as pre-6.0 write requests has unassigned seq no


I don't get this comment, can you clarify?

bleskes · 2017-04-03T09:23:02Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+        return new WriteReplicaResult<>(request, location, null, replica, logger);
+    }
+
+    public static Translog.Location performOnReplica(BulkShardRequest request, IndexShard replica) throws Exception {
        Translog.Location location = null;
        for (int i = 0; i < request.items().length; i++) {
            BulkItemRequest item = request.items()[i];
            if (shouldExecuteReplicaItem(item, i)) {


I wonder how much this check buys us now. We basically have 3 options. Either we index normally, index a failure/noop or skip all together. This is now hard to understand because it's partially in this method and partially here in the first if clause. Can we bring it all together?

I attempted to bring it all together through an enum (ReplicaItemExecutionMode) with three modes NORMAL, NOOP, FAILURE

bleskes · 2017-04-03T09:27:47Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+        return new Engine.NoOp(uid, seqNo, primaryTerm, Engine.Operation.Origin.REPLICA, startTime, reason);
+    }
+
+    public Engine.NoOpResult noOp(Engine.NoOp noOp) throws IOException {


I think we should call these something like markSeqNoAsNoOp. It feels weird to the the shard to do nothing ;) The previous method can maybe called preparingMarkingSeqNoAsNoOp?

bleskes · 2017-04-03T09:29:44Z

core/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

@@ -2615,10 +2615,13 @@ public void testHandleDocumentFailure() throws Exception {
                }
                Engine.IndexResult indexResult = engine.index(indexForDoc(doc1));
                assertNotNull(indexResult.getFailure());
-
+                // document failures should be recorded in translog
+                assertNotNull(indexResult.getTranslogLocation());


commenting here because this is the only change in this file. Shouldn't we have some test around the noop used on replicas?

there is already a test for engine noOp operation here, which the replica uses to record failures. I added a unit test (testNoOpReplicationOnPrimaryDocumentFailure) in TransportShardBulkActionTests to verify the noOp parameters when a primary document failure is seen

bleskes · 2017-04-03T09:30:22Z

core/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java

+                final BulkItemResponse response = index(indexRequest);
+                if (response.isFailed()) {
+                    throw response.getFailure().getCause();
+                } else if (response.isFailed() == false) {


this is always true no?

bleskes · 2017-04-03T09:32:58Z

core/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

+     */
+    public void testDocumentFailureReplication() throws Exception {
+        IndexMetaData metaData = buildIndexMetaData(1);
+        final ReplicationGroup replicationGroupWithDocumentFailureOnPrimary = new ReplicationGroup(metaData) {


can we call it something short and put it in a try finaly?

bleskes · 2017-04-03T09:37:31Z

core/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

+        IndexMetaData metaData = buildIndexMetaData(1);
+        final ReplicationGroup replicationGroupWithDocumentFailureOnPrimary = new ReplicationGroup(metaData) {
+            @Override
+            protected EngineFactory getEngineFactory(ShardRouting routing) {


can we encapsulate this in a utility class (similar to what i did)? there's a lot of craft here.

thanks for the pointer, I added a utility class

areek · 2017-04-06T20:20:34Z

thanks for the feedback @bleskes. I addressed all the comments. would appreciate another review.

bleskes

This LGTM. I left some minor comments. I also notice I went through this code too often now and I've "grown too used to it". Would be great if @dakrone can give it a review as well with some fresh eyes.

bleskes · 2017-04-13T13:49:31Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

@@ -533,6 +596,12 @@ static boolean shouldExecuteReplicaItem(final BulkItemRequest request, final int
        return replica.delete(delete);
    }

+    private static Engine.NoOpResult executeFailedSeqNoOnReplica(BulkItemResponse.Failure primaryFailure, DocWriteRequest docWriteRequest, IndexShard replica) throws IOException {
+        final Engine.NoOp noOp = replica.preparingMarkingSeqNoAsNoOp(docWriteRequest.type(), docWriteRequest.id(),


this is typically called prepareX

bleskes · 2017-04-13T13:50:37Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

-            final long startTime,
-            final String reason) {
-            super(uid, seqNo, primaryTerm, version, versionType, origin, startTime);
+                final Term uid,


any chance we can fold them into 1 line now?

bleskes · 2017-04-13T13:51:09Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

+                final Origin origin,
+                final long startTime,
+                final String reason) {
+            super(uid, seqNo, primaryTerm, 0, null, origin, startTime);


Does it work to use an illegal version value? -1?

bleskes · 2017-04-13T13:52:41Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -567,28 +567,35 @@ private IndexShardState changeState(IndexShardState newState, String reason) {
        return result;
    }

+    public Engine.NoOp preparingMarkingSeqNoAsNoOp(String type, String id, long seqNo, String reason) {
+        verifyReplicationTarget();
+        final Term uid = extractUid(type, id);


why does noop need an doc id? it's weird no?

bleskes · 2017-04-13T13:53:08Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -567,28 +567,35 @@ private IndexShardState changeState(IndexShardState newState, String reason) {
        return result;
    }

+    public Engine.NoOp preparingMarkingSeqNoAsNoOp(String type, String id, long seqNo, String reason) {
+        verifyReplicationTarget();


dakrone

LGTM, I left a few minor comments, thanks @areek!

dakrone · 2017-04-14T19:49:22Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+         * When primary execution failed before sequence no was generated
+         * or primary execution was a noop (only possible when request is originating from pre-6.0 nodes)
+         */
+        NOOP,


Can you add an assert that will fail on 7.0.0 that we should remove the NOOP and all handling code for it?

dakrone · 2017-04-14T19:50:04Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

+                        operationResult = executeFailedSeqNoOnReplica(failure, docWriteRequest, replica);
+                        assert operationResult != null : "operation result must never be null when primary response has no failure";
+                        location = syncOperationResultOrThrow(operationResult, location);
+                        break;


Can you add a default clause for this switch statement?

dakrone · 2017-04-14T19:53:34Z

core/src/main/java/org/elasticsearch/action/bulk/TransportShardBulkAction.java

@@ -533,6 +596,12 @@ static boolean shouldExecuteReplicaItem(final BulkItemRequest request, final int
        return replica.delete(delete);
    }

+    private static Engine.NoOpResult executeFailedSeqNoOnReplica(BulkItemResponse.Failure primaryFailure, DocWriteRequest docWriteRequest, IndexShard replica) throws IOException {


I think this should be named something a little closer to what it does (the name makes it sound like it executes the failing request on the replica), so maybe executeFailureNoOpOnReplica(..)?

dakrone · 2017-04-14T19:56:36Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -569,28 +569,34 @@ private IndexShardState changeState(IndexShardState newState, String reason) {
        return result;
    }

+    public Engine.NoOp prepareMarkingSeqNoAsNoOp(long seqNo, String reason) {


To follow the standard that we have use here, I think this should be named prepareNoOpOnReplica(...) (since we have prepareDeleteOnReplica and prepareIndexOnReplica)

dakrone · 2017-04-14T19:57:11Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+        return new Engine.NoOp(seqNo, primaryTerm, Engine.Operation.Origin.REPLICA, startTime, reason);
+    }
+
+    public Engine.NoOpResult markSeqNoAsNoOp(Engine.NoOp noOp) throws IOException {


Most of the other IndexShard methods correspond to their Engine implementations, so I think this should be called just noOp

This was my ask. I find IndexShard.noOp to be confusing - why are we doing nothing?. I would be good with renaming the engine method to match the one here. @dakrone would you be good with that?

That's reasonable to me!

Currently, when a primary write operation fails after generating a sequence number, the failure is not communicated to the replicas. Ideally, every operation which generates a sequence number on primary should be recorded in all replicas. In this change, a sequence number is associated with write operation failure. When a failure with an assinged seqence number arrives at a replica, the failure cause and sequence number is recorded in the translog and the sequence number is marked as completed via executing `Engine.noOp` on the replica engine.

Test that document failure (w/ seq no generated) are recorded as no-op in the translog for primary and replica shards

depanding on it's primary execution

…e not supported

* master: Add BucketMetricValue interface (elastic#24188) Enable index-time sorting (elastic#24055) Clarify elasticsearch user uid:gid mapping in Docker docs Update field-names-field.asciidoc (elastic#24178) ElectMasterService.hasEnoughMasterNodes should return false if no masters were found Remove Ubuntu 12.04 (elastic#24161) [Test] Add unit tests for InternalHDRPercentilesTests (elastic#24157) Replicate write failures (elastic#23314) Rename variable in translog simple commit test Strengthen translog commit with open view test Stronger check in translog prepare and commit test Fix translog prepare commit and commit test ingest-node.asciidoc - Clarify json processor (elastic#21876) Painless: more testing for script_stack (elastic#24168)

areek added :Sequence IDs >enhancement v6.0.0-alpha1 WIP labels Feb 22, 2017

areek requested a review from bleskes February 22, 2017 21:33

bleskes reviewed Feb 27, 2017

View reviewed changes

areek force-pushed the enhancement/write-failure-replication branch from a2e4e4d to 535168b Compare March 2, 2017 20:29

areek force-pushed the enhancement/write-failure-replication branch 4 times, most recently from 5e89b72 to e904db7 Compare March 29, 2017 16:40

areek mentioned this pull request Mar 29, 2017

Further refactor and extend testing for TransportShardBulkAction #23665

Merged

areek removed the WIP label Mar 29, 2017

areek force-pushed the enhancement/write-failure-replication branch from e904db7 to 9d62610 Compare March 30, 2017 20:32

bleskes suggested changes Apr 3, 2017

View reviewed changes

areek force-pushed the enhancement/write-failure-replication branch 5 times, most recently from d15d1f6 to ed82ee6 Compare April 6, 2017 20:20

areek force-pushed the enhancement/write-failure-replication branch from ed82ee6 to 1c943df Compare April 6, 2017 20:28

bleskes approved these changes Apr 13, 2017

View reviewed changes

areek force-pushed the enhancement/write-failure-replication branch from 1c943df to 180d654 Compare April 13, 2017 16:37

dakrone approved these changes Apr 14, 2017

View reviewed changes

bleskes mentioned this pull request Apr 17, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

areek force-pushed the enhancement/write-failure-replication branch from 180d654 to a53a58f Compare April 18, 2017 15:05

areek added 18 commits April 19, 2017 01:20

use zlong to serialize seq_no

e8e999d

Incorporate feedback

02b7eda

track write failures in translog as a noop in primary

5e829bb

Add tests for replicating write failures.

3b67b88

Test that document failure (w/ seq no generated) are recorded as no-op in the translog for primary and replica shards

Update to master

c6384c8

update shouldExecuteOnReplica comment

65c1a2a

rename indexshard noop to markSeqNoAsNoOp

d464285

remove redundant conditional

d096474

Consolidate possible replica action for bulk item request

8d4d646

depanding on it's primary execution

remove bulk shard result abstraction

4180727

fix failure handling logic for bwc

90e2419

add more tests

4878565

minor fix

2f33df8

cleanup

7f6b97a

incorporate feedback

8905c94

incorporate feedback

8f325df

add assert to remove handling noop primary response when 5.0 nodes ar…

d153fdc

…e not supported

areek force-pushed the enhancement/write-failure-replication branch from 01400ad to d153fdc Compare April 19, 2017 05:21

areek merged commit 4f773e2 into elastic:master Apr 19, 2017

bleskes mentioned this pull request Apr 20, 2017

Add primary term to doc write response #24171

Merged

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Replicate write failures #23314

Replicate write failures #23314

Conversation

areek commented Feb 22, 2017 • edited Loading

elasticmachine commented Feb 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Feb 27, 2017

areek commented Mar 29, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areek commented Apr 6, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areek commented Feb 22, 2017 •

edited

Loading