Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BWC layer to seq no infra and enable BWC tests #22185

Merged
merged 24 commits into from
Dec 19, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Dec 15, 2016

Sequence BWC logic consists of two elements:

  1. Wire level BWC using stream versions.
  2. A changed to the global checkpoint maintenance semantics.

For the sequence number infra to work with a mixed version clusters, we have to consider situation where the primary is on an old node and replicas are on new ones (i.e., the replicas will receive operations without seq#) and also the reverse (i.e., the primary sends operations to a replica but the replica can't process the seq# and respond with local checkpoint). An new primary with an old replica is a rare because we do not allow a replica to recover from a new primary. However, it can occur if the old primary failed and a new replica was promoted or during primary relocation where the source primary is treated as a replica until the master starts the target.

  1. Old Primary & New Replica - this case is easy as is taken care of by the wire level BWC. All incoming requests will have their seq# set to UNASSIGNED_SEQ_NO, which doesn't confuse the local checkpoint logic (keeping it at NO_OPS_PERFORMED)
  2. New Primary & Old replica - this one is trickier as the global checkpoint service currently takes all in sync replicas into consideration for the global checkpoint calculation. In order to deal with old replicas, we change the semantics to say all new node in sync replicas. That means the replicas on old nodes don't count for the global checkpointing. In this state the seq# infra is not fully operational (you can't search on it, because copies may miss it) but it is maintained on shards that can support it. The old replicas will have to go through a file based recovery at some point and will get the seq# information at that point. There is still an edge case where a new primary fails and an old replica takes over. I'lll discuss this one with @ywelsch as I prefer to avoid it completely.

This PR also re-enables the BWC tests which were disabled. As such it had to fix any BWC issue that had crept in. Most notably an issue with the removal of the timestamp field in #21670, which I marked with nocommit. The problem with the removal is that old nodes still expect a non-null field value . I considered serializing the current time stamp but that causes indexing requests to different shards to be potentially different. The two other options I saw was to always send a crazy but valid value (i.e., 0) or add the logic to 5.x to deal with null properly. I currently opted for the first one. I'll reach out to @jpountz to discuss this one further.

Last - I added some debugging tools like more sane node names and forcing replication request to implement a toString

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2016

I spoke to @jpountz and indices created on 5.x do not index the timestamp field. Therefore passing a 0 for the transport layer is the simplest solution (as the value doesn't have any effect on the 5.x size but also conforms to 5.x semantics)

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of comments - this looks great!

@@ -43,6 +43,5 @@ html_docs

# random old stuff that we should look at the necessity of...
/tmp/
backwards/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? this is where our bwc index creation python tool stores their versions?

Copy link
Contributor Author

@bleskes bleskes Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... maybe we need a different solution then - the problem is that with this line the this file was ignored by git.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call the folder bwc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw I pushed another commit updating gitignore to be more explicit about the backwords folder (rather than using a global glob pattern). I also updated the comments around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -58,6 +58,6 @@ public void writeTo(StreamOutput out) throws IOException {

@Override
public String toString() {
return "flush {" + super.toString() + "}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -36,6 +38,9 @@
public abstract class ReplicatedWriteRequest<R extends ReplicatedWriteRequest<R>> extends ReplicationRequest<R> implements WriteRequest<R> {
private RefreshPolicy refreshPolicy = RefreshPolicy.NONE;

long seqNo;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespaces? can this be private or protected?

* Returns the sequence number for this operation. The sequence number is assigned while the operation
* is performed on the primary shard.
*/
public long seqNo() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get/set prefix if possible :)

@@ -204,7 +204,7 @@ public void writeTo(StreamOutput out) throws IOException {
// timestamp
out.writeBoolean(false); // enabled
out.writeString(DateFieldMapper.DEFAULT_DATE_TIME_FORMATTER.format());
out.writeOptionalString(null);
out.writeOptionalString("now"); // old default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/old/5.x/

bwcVersion = "6.0.0-alpha1-SNAPSHOT"
numNodes = 4
numBwcNodes = 2
bwcVersion = "5.2.0-SNAPSHOT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w00t

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2016

retest this please

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2016

I'lll discuss this one with @ywelsch as I prefer to avoid it completely.

@ywelsch and I decided to think about it some more - I noted it down in #10708 and we will adress it later. At the moment we do not test the situation where a primary on a new node fails while have two replicas - one on the old node and one on the new.

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2016

retest this please

@bleskes
Copy link
Contributor Author

bleskes commented Dec 15, 2016

@s1monw can you take another look? CI is happy.

@@ -36,6 +38,8 @@
public abstract class ReplicatedWriteRequest<R extends ReplicatedWriteRequest<R>> extends ReplicationRequest<R> implements WriteRequest<R> {
private RefreshPolicy refreshPolicy = RefreshPolicy.NONE;

private long seqNo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be initialized with SequenceNumbersService.UNASSIGNED_SEQ_NO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied over code, but good catch I'll adapt. The problem is that serialization requires a positive number. I'll adapt that too.

@@ -524,7 +524,8 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeOptionalString(routing);
out.writeOptionalString(parent);
if (out.getVersion().before(Version.V_6_0_0_alpha1_UNRELEASED)) {
out.writeOptionalString(null);
// timestamp, at this point #proccess was called which for 5.x meant this was set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this explanation. why is 0 ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll copy over the explanation from the discussion on the PR

I spoke to @jpountz and indices created on 5.x do not index the timestamp field. Therefore passing a 0 for the transport layer is the simplest solution (as the value doesn't have any effect on the 5.x size but also conforms to 5.x semantics)

@@ -283,7 +283,7 @@ protected String checkActiveShardCount() {
}

private void decPendingAndFinishIfNeeded() {
assert pendingActions.get() > 0;
assert pendingActions.get() > 0 : "pending action count goes bellow 0 for request [" + request + "]";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bellow/below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thx.

protected void sendReplicaRequest(ConcreteShardRequest<ReplicaRequest> concreteShardRequest, DiscoveryNode node,
ActionListener<ReplicationOperation.ReplicaResponse> listener) {
transportService.sendRequest(node, transportReplicaAction, concreteShardRequest, transportOptions,
// Eclipse can't handle when this is <> so we specify the type here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still valid? I see that you removed the type..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. @jpountz maybe you can help verify eclipse is happy with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed it is not happy about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @jpountz, I'll revert the change

allocationId = in.readString();
} else {
// we use to read empty responses
Empty.INSTANCE.readFrom(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove this? it's doing nothing so a comment is fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -1060,6 +1070,14 @@ public void onFailure(Exception shardFailedError) {
}
}

/** sends the give replica request to the supplied nodes */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/give/given

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

numBwcNodes = 1
bwcVersion = "6.0.0-alpha1-SNAPSHOT"
numNodes = 4
numBwcNodes = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a bit too much to just run the new IndexingIT. The YAML tests will then also execute with that many nodes and the Windows VMs will take forever. It might also double the heap size needed to run the tests (4 * 2GB per ES instance?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was doubting on that one too. 4 nodes total is really required to have a good bwc case for the global checkpoint management (we need two replicas and one primary). I see what you are saying but I was doubting whether it's worth introducing a whole now BWC cluster. Setting it up might be slower than just running the yaml tests..

@bleskes
Copy link
Contributor Author

bleskes commented Dec 16, 2016

retest this please

@bleskes
Copy link
Contributor Author

bleskes commented Dec 16, 2016

@ywelsch the build is now failing because of your suggestion to default the seqno to -2 (which is great) as it now triggers assertions for on what #22212 also fixes, but only because Jason found it after chasing subtle test failures. I'll wait for #22212 to be merged before putting this one in.

@bleskes
Copy link
Contributor Author

bleskes commented Dec 19, 2016

@s1monw @ywelsch I think this is ready. Can you take a final look?

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks so much for doing this

@bleskes bleskes merged commit b857b31 into elastic:master Dec 19, 2016
@bleskes bleskes deleted the seq_no_bwc branch December 19, 2016 12:08
@bleskes
Copy link
Contributor Author

bleskes commented Dec 19, 2016

Thx @s1monw @ywelsch . This is now merged..

@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants