Add BWC layer to seq no infra and enable BWC tests #22185

bleskes · 2016-12-15T08:05:29Z

Sequence BWC logic consists of two elements:

Wire level BWC using stream versions.
A changed to the global checkpoint maintenance semantics.

For the sequence number infra to work with a mixed version clusters, we have to consider situation where the primary is on an old node and replicas are on new ones (i.e., the replicas will receive operations without seq#) and also the reverse (i.e., the primary sends operations to a replica but the replica can't process the seq# and respond with local checkpoint). An new primary with an old replica is a rare because we do not allow a replica to recover from a new primary. However, it can occur if the old primary failed and a new replica was promoted or during primary relocation where the source primary is treated as a replica until the master starts the target.

Old Primary & New Replica - this case is easy as is taken care of by the wire level BWC. All incoming requests will have their seq# set to UNASSIGNED_SEQ_NO, which doesn't confuse the local checkpoint logic (keeping it at NO_OPS_PERFORMED)
New Primary & Old replica - this one is trickier as the global checkpoint service currently takes all in sync replicas into consideration for the global checkpoint calculation. In order to deal with old replicas, we change the semantics to say all new node in sync replicas. That means the replicas on old nodes don't count for the global checkpointing. In this state the seq# infra is not fully operational (you can't search on it, because copies may miss it) but it is maintained on shards that can support it. The old replicas will have to go through a file based recovery at some point and will get the seq# information at that point. There is still an edge case where a new primary fails and an old replica takes over. I'lll discuss this one with @ywelsch as I prefer to avoid it completely.

This PR also re-enables the BWC tests which were disabled. As such it had to fix any BWC issue that had crept in. Most notably an issue with the removal of the timestamp field in #21670, which I marked with nocommit. The problem with the removal is that old nodes still expect a non-null field value . I considered serializing the current time stamp but that causes indexing requests to different shards to be potentially different. The two other options I saw was to always send a crazy but valid value (i.e., 0) or add the logic to 5.x to deal with null properly. I currently opted for the first one. I'll reach out to @jpountz to discuss this one further.

Last - I added some debugging tools like more sane node names and forcing replication request to implement a toString

bleskes · 2016-12-15T10:09:58Z

I spoke to @jpountz and indices created on 5.x do not index the timestamp field. Therefore passing a 0 for the transport layer is the simplest solution (as the value doesn't have any effect on the 5.x size but also conforms to 5.x semantics)

s1monw

I left a bunch of comments - this looks great!

s1monw · 2016-12-15T13:17:16Z

.gitignore

@@ -43,6 +43,5 @@ html_docs

 # random old stuff that we should look at the necessity of...
 /tmp/
-backwards/


why? this is where our bwc index creation python tool stores their versions?

hmm... maybe we need a different solution then - the problem is that with this line the this file was ignored by git.

call the folder bwc?

@s1monw I pushed another commit updating gitignore to be more explicit about the backwords folder (rather than using a global glob pattern). I also updated the comments around it.

s1monw · 2016-12-15T13:17:52Z

core/src/main/java/org/elasticsearch/action/admin/indices/flush/ShardFlushRequest.java

@@ -58,6 +58,6 @@ public void writeTo(StreamOutput out) throws IOException {

    @Override
    public String toString() {
-        return "flush {" + super.toString() + "}";


s1monw · 2016-12-15T13:18:56Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicatedWriteRequest.java

@@ -36,6 +38,9 @@
 public abstract class ReplicatedWriteRequest<R extends ReplicatedWriteRequest<R>> extends ReplicationRequest<R> implements WriteRequest<R> {
    private RefreshPolicy refreshPolicy = RefreshPolicy.NONE;

+    long seqNo;
+


extra whitespaces? can this be private or protected?

s1monw · 2016-12-15T13:19:09Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicatedWriteRequest.java

+     * Returns the sequence number for this operation. The sequence number is assigned while the operation
+     * is performed on the primary shard.
+     */
+    public long seqNo() {


get/set prefix if possible :)

s1monw · 2016-12-15T13:19:53Z

core/src/main/java/org/elasticsearch/cluster/metadata/MappingMetaData.java

@@ -204,7 +204,7 @@ public void writeTo(StreamOutput out) throws IOException {
            // timestamp
            out.writeBoolean(false); // enabled
            out.writeString(DateFieldMapper.DEFAULT_DATE_TIME_FORMATTER.format());
-            out.writeOptionalString(null);
+            out.writeOptionalString("now"); // old default


s1monw · 2016-12-15T13:20:36Z

qa/backwards-5.0/build.gradle

-    bwcVersion = "6.0.0-alpha1-SNAPSHOT"
+    numNodes = 4
+    numBwcNodes = 2
+    bwcVersion = "5.2.0-SNAPSHOT"


bleskes · 2016-12-15T15:35:58Z

retest this please

bleskes · 2016-12-15T16:00:52Z

I'lll discuss this one with @ywelsch as I prefer to avoid it completely.

@ywelsch and I decided to think about it some more - I noted it down in #10708 and we will adress it later. At the moment we do not test the situation where a primary on a new node fails while have two replicas - one on the old node and one on the new.

bleskes · 2016-12-15T17:07:41Z

retest this please

bleskes · 2016-12-15T22:15:07Z

@s1monw can you take another look? CI is happy.

ywelsch · 2016-12-15T22:23:40Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicatedWriteRequest.java

@@ -36,6 +38,8 @@
 public abstract class ReplicatedWriteRequest<R extends ReplicatedWriteRequest<R>> extends ReplicationRequest<R> implements WriteRequest<R> {
    private RefreshPolicy refreshPolicy = RefreshPolicy.NONE;

+    private long seqNo;


should this be initialized with SequenceNumbersService.UNASSIGNED_SEQ_NO?

I just copied over code, but good catch I'll adapt. The problem is that serialization requires a positive number. I'll adapt that too.

ywelsch · 2016-12-15T22:35:26Z

core/src/main/java/org/elasticsearch/action/index/IndexRequest.java

@@ -524,7 +524,8 @@ public void writeTo(StreamOutput out) throws IOException {
        out.writeOptionalString(routing);
        out.writeOptionalString(parent);
        if (out.getVersion().before(Version.V_6_0_0_alpha1_UNRELEASED)) {
-            out.writeOptionalString(null);
+            // timestamp, at this point #proccess was called which for 5.x meant this was set


I don't really understand this explanation. why is 0 ok?

I'll copy over the explanation from the discussion on the PR

I spoke to @jpountz and indices created on 5.x do not index the timestamp field. Therefore passing a 0 for the transport layer is the simplest solution (as the value doesn't have any effect on the 5.x size but also conforms to 5.x semantics)

ywelsch · 2016-12-15T22:37:52Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

@@ -283,7 +283,7 @@ protected String checkActiveShardCount() {
    }

    private void decPendingAndFinishIfNeeded() {
-        assert pendingActions.get() > 0;
+        assert pendingActions.get() > 0 : "pending action count goes bellow 0 for request [" + request + "]";


s/bellow/below

fixed. thx.

ywelsch · 2016-12-15T22:42:10Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+    protected void sendReplicaRequest(ConcreteShardRequest<ReplicaRequest> concreteShardRequest, DiscoveryNode node,
+                                      ActionListener<ReplicationOperation.ReplicaResponse> listener) {
+        transportService.sendRequest(node, transportReplicaAction, concreteShardRequest, transportOptions,
+            // Eclipse can't handle when this is <> so we specify the type here.


is this comment still valid? I see that you removed the type..

good catch. @jpountz maybe you can help verify eclipse is happy with this?

indeed it is not happy about it

thx @jpountz, I'll revert the change

ywelsch · 2016-12-15T22:46:47Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

+                allocationId = in.readString();
+            } else {
+                // we use to read empty responses
+                Empty.INSTANCE.readFrom(in);


just remove this? it's doing nothing so a comment is fine?

ywelsch · 2016-12-15T22:48:10Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -1060,6 +1070,14 @@ public void onFailure(Exception shardFailedError) {
        }
    }

+    /** sends the give replica request to the supplied nodes */


s/give/given

ywelsch · 2016-12-15T23:02:22Z

qa/backwards-5.0/build.gradle

-    numBwcNodes = 1
-    bwcVersion = "6.0.0-alpha1-SNAPSHOT"
+    numNodes = 4
+    numBwcNodes = 2


I wonder if this is a bit too much to just run the new IndexingIT. The YAML tests will then also execute with that many nodes and the Windows VMs will take forever. It might also double the heap size needed to run the tests (4 * 2GB per ES instance?).

Yeah, I was doubting on that one too. 4 nodes total is really required to have a good bwc case for the global checkpoint management (we need two replicas and one primary). I see what you are saying but I was doubting whether it's worth introducing a whole now BWC cluster. Setting it up might be slower than just running the yaml tests..

bleskes · 2016-12-16T13:02:19Z

retest this please

…tead of a general glob)

bleskes · 2016-12-16T20:28:50Z

@ywelsch the build is now failing because of your suggestion to default the seqno to -2 (which is great) as it now triggers assertions for on what #22212 also fixes, but only because Jason found it after chasing subtle test failures. I'll wait for #22212 to be merged before putting this one in.

bleskes · 2016-12-19T09:57:24Z

@s1monw @ywelsch I think this is ready. Can you take a final look?

s1monw

LGTM thanks so much for doing this

bleskes · 2016-12-19T12:09:26Z

Thx @s1monw @ywelsch . This is now merged..

bleskes added 14 commits December 10, 2016 20:30

wire level compatibility

a9d297d

bwc test

adabd89

strengthen test

4cacc8b

only account for shards on new nodes for global checkpoints

c2cb646

fix timestamp for now as it makes assertion fail

02b223c

linting

2db7210

improve assertion message

634d92f

line length

4ea5dd9

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

26596ff

add skip version to cat.shards help test

001f7de

force all replication requests to have toString

cc9a486

missing else :(

f2ca825

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

a945030

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

f4dada9

bleskes added :Sequence IDs >enhancement v6.0.0-alpha1 labels Dec 15, 2016

s1monw reviewed Dec 15, 2016

View reviewed changes

feedback and nocommit removal

059638b

ywelsch reviewed Dec 15, 2016

View reviewed changes

feedback

2a15b15

bleskes added 2 commits December 16, 2016 17:07

update gitignore to explicitly point to backwards release folder (ins…

647d5ee

…tead of a general glob)

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

76338bf

bleskes added 6 commits December 17, 2016 16:08

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

896482f

fix compilation

f73d165

don't replicate on failures (like version conflicts)

2cc2610

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

bc60297

Merge remote-tracking branch 'upstream/master' into seq_no_bwc

2ecbac0

put back diamond operator for ecplise

7577b12

s1monw approved these changes Dec 19, 2016

View reviewed changes

bleskes merged commit b857b31 into elastic:master Dec 19, 2016

bleskes deleted the seq_no_bwc branch December 19, 2016 12:08

bleskes mentioned this pull request Jan 16, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Add BWC layer to seq no infra and enable BWC tests #22185

Add BWC layer to seq no infra and enable BWC tests #22185

Conversation

bleskes commented Dec 15, 2016

bleskes commented Dec 15, 2016

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes Dec 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Dec 15, 2016

bleskes commented Dec 15, 2016

bleskes commented Dec 15, 2016

bleskes commented Dec 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Dec 16, 2016

bleskes commented Dec 16, 2016

bleskes commented Dec 19, 2016

s1monw left a comment

Choose a reason for hiding this comment

bleskes commented Dec 19, 2016

bleskes Dec 15, 2016 •

edited

Loading