Promote replica on the highest version node #25277

dakrone · 2017-06-16T17:23:35Z

This changes the replica selection to prefer to return replicas on the highest
version when choosing a replacement to promote when the primary shard fails.

Consider this situation:

A replica on a 5.6 node
Another replica on a 6.0 node
The primary on a 6.0 node

The primary shard is sending sequence numbers to the replica on the 6.0 node and
skipping sending them for the 5.6 node. Now assume that the primary shard fails
and (prior to this change) the replica on 5.6 node gets promoted to primary, it
now has no knowledge of sequence numbers and the replica on the 6.0 node will be
expecting sequence numbers but will never receive them.

Relates to #10708

This changes the replica selection to prefer to return replicas on the highest version when choosing a replacement to promote when the primary shard fails. Consider this situation: - A replica on a 5.6 node - Another replica on a 6.0 node - The primary on a 6.0 node The primary shard is sending sequence numbers to the replica on the 6.0 node and skipping sending them for the 5.6 node. Now assume that the primary shard fails and (prior to this change) the replica on 5.6 node gets promoted to primary, it now has no knowledge of sequence numbers and the replica on the 6.0 node will be expecting sequence numbers but will never receive them. Relates to elastic#10708

dakrone · 2017-06-19T21:44:14Z

@jasontedor could you take a look at this please?

ywelsch

I've left some comments and suggestions. This also needs to go into 5.6 as a 5.6 master in a mixed 5.6/6.x cluster should be running this code as well.

ywelsch · 2017-06-20T08:17:32Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+                // calls this method with an out-of-date RoutingNodes, where the version might not
+                // be accessible. Therefore, we need to protect against the version being null
+                // (meaning the node will be going away).
+                Version replicaNodeVersion = nodesToVersions.get(shardRouting.currentNodeId());


The nodesToVersions map is not needed. You can get the version using node(shardRouting.currentNodeId()).node().getVersion().
Not having this extra nodesToVersions map also solves consistency issues where entries are removed from nodesToShards but not nodesToVersions.

Ahh thanks, I didn't know about that, I removed the map.

ywelsch · 2017-06-20T08:34:31Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+                if (replicaNodeVersion == null && candidate == null) {
+                    // Only use this replica if there are no other candidates
+                    candidate = shardRouting;
+                } else if (highestVersionSeen == null || (replicaNodeVersion != null && replicaNodeVersion.after(highestVersionSeen))) {


This method looks like it could enjoy Java 8 lambdas, for example something along the lines of:

return assignedShards(shardId).stream() .filter(shr -> !shr.primary() && shr.active()) .max(Comparator.comparing(shr -> node(shr.currentNodeId()).node(), Comparator.nullsFirst(Comparator.comparing(DiscoveryNode::getVersion)))) .orElse(null);

mutters something about Java pretending to be a functional language

I don't agree with the word "enjoy" (I think the lambda version is messier than the non-lambda version since there's no Monads in Java) but I did this because you asked for it.

ywelsch · 2017-06-20T08:39:16Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedShardsRoutingTests.java

+        // add a single node
+        clusterState = ClusterState.builder(clusterState).nodes(
+                DiscoveryNodes.builder()
+                .add(newNode("node1-5.x", Version.V_5_6_0)))


Can you generalize the test to use two arbitrary (but distinct) versions? i.e. VersionUtils.randomVersion()

No? Currently the only situation that is valid for a mixed-major-version cluster is 5.6 and 6.0, we don't support mixed clusters of any other versions and 5.6.1 isn't out yet. I'm not sure how randomization would help here, other than triggering some other version-related failures :)

this PR does more than just an ordering on 5.6/6.x. It also orders among 6.0 and 6.1 nodes, which is left untested here. Either we restrict the "Promote replica on the highest version node" logic to only order 6.x nodes before 5.6 (and leave 6.0 and 6.1 unordered) or we test that this logic also properly orders 6.0 and 6.1. I agree there is no need to test 5.1 and 6.2.

Okay, I randomized the versions

ywelsch · 2017-06-20T08:47:36Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedShardsRoutingTests.java

+        ShardRouting startedReplica = clusterState.getRoutingNodes().activeReplicaWithHighestVersion(shardId);
+        logger.info("--> all shards allocated, replica that should be promoted: {}", startedReplica);
+
+        // fail the primary shard, check replicas get removed as well...


replicas should not be removed? Looks like copy pasta from another test

Yep, I fixed this, thanks

ywelsch · 2017-06-20T08:57:01Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedShardsRoutingTests.java

@@ -556,4 +558,118 @@ public void testFailAllReplicasInitializingOnPrimaryFailWhileHavingAReplicaToEle
        ShardRouting newPrimaryShard = clusterState.routingTable().index("test").shard(0).primaryShard();
        assertThat(newPrimaryShard, not(equalTo(primaryShardToFail)));
    }
+
+    public void testReplicaOnNewestVersionIsPromoted() {


This test checks one specific scenario. I think that it can be easily generalized in the way of the IndicesClusterStateServiceRandomUpdatesTests so that it simulates a large range of scenarios.
Essentially it would boil down to creating a few nodes with random version (see randomInitialClusterState of IndicesClusterStateServiceRandomUpdatesTests), allocating a few shards to the nodes (see ClusterStateChanges.createIndex), then failing some of the shards (incl. primary), see ClusterStateChanges.applyFailedShards or failing some of the nodes (incl. primary), see ClusterStateChanges.deassociateDeadNodes and then checking that the new primary is on the node with highest version.

… node

…ling me.

…shards

dakrone · 2017-06-22T20:37:46Z

@ywelsch I added a test similar to what we talked about, as well as addressing your other feedback, please take another look!

ywelsch

I left a few more comments about the tests.

ywelsch · 2017-06-26T12:40:42Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+        }
+
+        logger.info("--> starting shards");
+        state = cluster.applyStartedShards(state, state.getRoutingNodes().shardsWithState(INITIALIZING));;


extra semicolon

ywelsch · 2017-06-26T12:41:22Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+
+        logger.info("--> starting shards");
+        state = cluster.applyStartedShards(state, state.getRoutingNodes().shardsWithState(INITIALIZING));;
+        state = cluster.reroute(state, new ClusterRerouteRequest());


reroute happens as part of applyStartedShards in the line above

ywelsch · 2017-06-26T12:42:33Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+        state = cluster.applyStartedShards(state, state.getRoutingNodes().shardsWithState(INITIALIZING));;
+        state = cluster.reroute(state, new ClusterRerouteRequest());
+        logger.info("--> starting replicas");
+        state = cluster.applyStartedShards(state, state.getRoutingNodes().shardsWithState(INITIALIZING));;


there is no guarantee that all replicas are started (as we have throttling). It's good to test the situation where not all replicas are started though, so maybe we can call applyStartedShards a random number of times.

ywelsch · 2017-06-26T12:47:12Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+            for (ShardRouting shardRouting : state.getRoutingNodes().shardsWithState(STARTED)) {
+                if (shardRouting.primary() && randomBoolean()) {
+                    ShardRouting replicaToBePromoted = state.getRoutingNodes()
+                        .activeReplicaWithHighestVersion(shardRouting.shardId());


you're testing the method activeReplicaWithHighestVersion here using the method itself? I see no checks here that the primary is indeed on the node with the highest version. I think for the purpose of the test it is sufficient to check that

if there was at least one active replica while the primary was failed, that a new active primary got assigned

That the new active primary is on a node with higher or equal version than the replicas.

I changed the test to verify candidates using without the activeReplicaWithHighestVersion method

ywelsch · 2017-06-26T12:47:30Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+            Settings.Builder settingsBuilder = Settings.builder()
+                    .put(SETTING_NUMBER_OF_SHARDS, randomIntBetween(1, 3))
+                    .put(SETTING_NUMBER_OF_REPLICAS, randomIntBetween(1, 3))
+                    .put("index.routing.allocation.total_shards_per_node", 1);


I removed this :)

ywelsch · 2017-06-26T12:49:38Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+        ClusterState previousState = state;
+        // apply cluster state to nodes (incl. master)
+        for (DiscoveryNode node : state.nodes()) {
+            IndicesClusterStateService indicesClusterStateService = clusterStateServiceMap.get(node);


This test does not require IndicesClusterStateService, only the ClusterStateChanges class. All the code in this block can go away, it does not add anything to the test.
The test can be put into FailedShardsRoutingTests.

it does use the randomInitialClusterState method, which I'm not sure we want to duplicate, is it worth coupling the tests just to put it in the other location? (edit: I misread and thought two methods were used, only one is)

randomInitialClusterState is 5 lines. I think we can duplicate that :)

Okay, I moved it.

ywelsch · 2017-06-26T12:52:42Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+                        List<FailedShard> shardsToFail = new ArrayList<>();
+                        logger.info("--> found replica that should be promoted: {}", replicaToBePromoted);
+                        logger.info("--> failing shard {}", shardRouting);
+                        shardsToFail.add(new FailedShard(shardRouting, "failed primary", new Exception()));


you're testing only one failure at a time.
Instead, the test could select a subset of the primary shards at random (and also a few replica shards) and fail them in one go.

ywelsch · 2017-06-26T12:54:15Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+                .filter(shr -> !shr.primary() && shr.active())
+                .filter(shr -> node(shr.currentNodeId()) != null)
+                .max(Comparator.comparing(shr -> node(shr.currentNodeId()).node(),
+                                Comparator.nullsFirst(Comparator.comparing(DiscoveryNode::getVersion))))


can you readd the comment why we need to consider "null" here?

Re-added this comment

ywelsch · 2017-06-26T12:55:12Z

...est/java/org/elasticsearch/indices/cluster/IndicesClusterStateServiceRandomUpdatesTests.java

+                        logger.info("--> found replica that should be promoted: {}", replicaToBePromoted);
+                        logger.info("--> failing shard {}", shardRouting);
+                        shardsToFail.add(new FailedShard(shardRouting, "failed primary", new Exception()));
+                        state = cluster.applyFailedShards(state, shardsToFail);


an alternative to explicit shard failing is to remove nodes where the shards are allocated (i.e. when a node disconnects from the cluster).
This would also test the scenario where DiscoveryNode is null in the RoutingNode.

ywelsch · 2017-06-26T12:56:15Z

core/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

@@ -24,6 +24,7 @@
 import org.apache.logging.log4j.Logger;
 import org.apache.lucene.util.CollectionUtil;
 import org.elasticsearch.Assertions;
+import org.elasticsearch.Version;


unused import?

dakrone · 2017-06-27T22:09:04Z

@ywelsch I pushed a few commits addressing your feedback, thanks again for taking a look at this.

ywelsch

Left a few more minor comments. Address as you see fit.

ywelsch · 2017-06-28T08:48:48Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+            DiscoveryNodes newNodes = DiscoveryNodes.builder(state.nodes())
+                .add(createNode()).build();
+            state = ClusterState.builder(state).nodes(newNodes).build();
+            state = cluster.reroute(state, new ClusterRerouteRequest()); // always reroute after node leave


this comment is stale (there are no nodes removed here)

ywelsch · 2017-06-28T08:49:38Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+            state = cluster.reroute(state, new ClusterRerouteRequest()); // always reroute after node leave
+        }
+
+        // Log the shard versions (for debugging if necessary)


Log the node versions?
Can also be done directly in the loop where you are adding the nodes :-)

ywelsch · 2017-06-28T08:51:31Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+            state = cluster.createIndex(state, request);
+            assertTrue(state.metaData().hasIndex(name));
+        }
+        state = cluster.reroute(state, new ClusterRerouteRequest());


this is not needed. createIndex automatically reroutes.

ywelsch · 2017-06-28T08:52:37Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+            state = cluster.applyStartedShards(state, state.getRoutingNodes().shardsWithState(INITIALIZING));
+        }
+
+        logger.info("--> state before failing shards: {}", state);


ywelsch · 2017-06-28T08:56:57Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+    }
+
+    private static Version getNodeVersion(ShardRouting shardRouting, ClusterState state) {
+        for (ObjectObjectCursor<String, DiscoveryNode> entry : state.getNodes().getDataNodes()) {


no need for iteration here, you can get the node directly by calling state.getNodes().get(shardRouting.currentNodeId()) (which will return null if no node found)

ywelsch · 2017-06-28T09:03:30Z

core/src/test/java/org/elasticsearch/cluster/routing/allocation/FailedNodeRoutingTests.java

+                    .filter(s -> currentState.getRoutingNodes().node(s.currentNodeId()) != null)
+                    .collect(Collectors.toSet());
+                // If we find a replica and at least another candidate
+                if (replicaToBePromoted != null && candidates.size() > 0) {


we don't need to determine replicaToBePromoted. Candidates also does not need to be filtered with !s.equals(replicaToBePromoted). It's ok to just check candidates.size() > 0 here to see whether there is going to be a new primary. In that case, we fail the primary + random(0, candidates.size() - 1) replicas and check afterwards that the new primary is on a node that is at least as high as all replicas.

dakrone · 2017-06-28T17:39:44Z

Thanks @ywelsch, I rewrote the check to be more in line with what we discussed.

* Promote replica on the highest version node This changes the replica selection to prefer to return replicas on the highest version when choosing a replacement to promote when the primary shard fails. Consider this situation: - A replica on a 5.6 node - Another replica on a 6.0 node - The primary on a 6.0 node The primary shard is sending sequence numbers to the replica on the 6.0 node and skipping sending them for the 5.6 node. Now assume that the primary shard fails and (prior to this change) the replica on 5.6 node gets promoted to primary, it now has no knowledge of sequence numbers and the replica on the 6.0 node will be expecting sequence numbers but will never receive them. Relates to #10708 * Switch from map of node to version to retrieving the version from the node * Remove uneeded null check * You can pretend you're a functional language Java, but you're not fooling me. * Randomize node versions * Add test with random cluster state with multiple versions that fails shards * Re-add comment and remove extra import * Remove unneeded stuff, randomly start replicas a few more times * Move test into FailedNodeRoutingTests * Make assertions actually test replica version promotion * Rewrite test, taking Yannick's feedback into account

* master: (129 commits) Add doc note regarding explicit publish host Fix typo in name of test Add additional test for sequence-number recovery WrapperQueryBuilder should also rewrite the parsed query. Remove dead code and stale Javadoc Update defaults in documentation (#25483) [DOCS] Add docs-dir to Painless (#25482) Add concurrent deprecation logger test [DOCS] Update shared attributes for Elasticsearch (#25479) Use LRU set to reduce repeat deprecation messages Add NioTransport threads to thread name checks (#25477) Add shortcut for AbstractQueryBuilder.parseInnerQueryBuilder to QueryShardContext Prevent channel enqueue after selector close (#25478) Fix Java 9 compilation issue Remove unregistered `transport.netty.*` settings (#25476) Handle ping correctly in NioTransport (#25462) Tests: Remove platform specific assertion in NioSocketChannelTests Remove QueryParseContext from parsing QueryBuilders (#25448) Promote replica on the highest version node (#25277) test: added not null assertion ...

dakrone added :Sequence IDs v6.0.0 labels Jun 16, 2017

bleskes added the :Allocation label Jun 19, 2017

dakrone force-pushed the promote-newer-replicas branch from 1e7ea04 to 3aa3edc Compare June 19, 2017 15:14

dakrone requested a review from jasontedor June 19, 2017 21:44

ywelsch suggested changes Jun 20, 2017

View reviewed changes

dakrone added blocker v5.6.0 labels Jun 20, 2017

dakrone added 4 commits June 20, 2017 10:27

Switch from map of node to version to retrieving the version from the…

1273e99

… node

Remove uneeded null check

879eef0

You can pretend you're a functional language Java, but you're not foo…

cb42f65

…ling me.

Randomize node versions

54edaf8

dakrone mentioned this pull request Jun 21, 2017

Rolling upgrade tests with 3 nodes #25336

Closed

This was referenced Jun 22, 2017

Add Sequence Numbers to write operations #10708

Closed

Sequence Numbers related work slated for 6.0.0 #25355

Closed

Add test with random cluster state with multiple versions that fails …

ef8c79d

…shards

ywelsch reviewed Jun 26, 2017

View reviewed changes

dakrone added 4 commits June 27, 2017 13:31

Re-add comment and remove extra import

27469f5

Remove unneeded stuff, randomly start replicas a few more times

4da03ea

Move test into FailedNodeRoutingTests

56dfe80

Make assertions actually test replica version promotion

3ca757b

ywelsch approved these changes Jun 28, 2017

View reviewed changes

Rewrite test, taking Yannick's feedback into account

9d9d4b7

dakrone merged commit 22ff76d into elastic:master Jun 29, 2017

clintongormley added v6.0.0-beta1 >enhancement and removed v6.0.0 labels Jul 25, 2017

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Promote replica on the highest version node #25277

Promote replica on the highest version node #25277

Conversation

dakrone commented Jun 16, 2017

dakrone commented Jun 19, 2017

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone commented Jun 22, 2017

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone commented Jun 27, 2017

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dakrone commented Jun 28, 2017

dakrone Jun 27, 2017 •

edited

Loading