Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a History UUID as a requirement for ops based recovery #26577

Merged
merged 20 commits into from
Sep 14, 2017

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Sep 11, 2017

The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).

The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).

We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.

Last the PR also tightens up the connection between the checkpoint file, it's translog and it's lucene index by adding both the translog uuid and the history uuid to it and verifying on read.

PS I still want to go through the test and make sure the coverage is good enough. I also want to validate the BWC logic that will only run properly on CI once this is backported. That said, I think we can start reviewing.

@bleskes bleskes added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >enhancement v6.0.0 v7.0.0 labels Sep 11, 2017
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. The one question I have is regarding isTranslogReadyForSequenceNumberBasedRecovery where we use to have a norelease when the local checkpoint on the target shard is greater than the maximum sequence number on the source shard. I think that should not be possible any more. Would you double check and make this a hard assertion?

translogUUID = loadTranslogUUIDFromCommit(writer);
// We expect that this shard already exists, so it must already have an existing translog else something is badly wrong!
if (translogUUID == null) {
throw new IndexFormatTooOldException("translog", "translog has no generation nor a UUID - this might be an index from a previous version consider upgrading to N-1 first");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recognize that this is only moving code but can you use Version.CURRENT to specify a more precise version in the error message rather than a generic message?

if (checkpoint.translogUUID.equals(TRANSLOG_UUID_NA) == false &&
checkpoint.translogUUID.equals(translogUUID) == false
) {
throw new TranslogCorruptedException("expected translog UUID " + translogUUID+ " but got: " + checkpoint.translogUUID +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: translogUUID+ -> translogUUID +

if (checkpoint.historyUUID.equals(HISTORY_UUID_NA) == false &&
checkpoint.historyUUID.equals(historyUUID) == false
) {
throw new TranslogCorruptedException("expected history UUID " + historyUUID+ " but got: " + checkpoint.historyUUID +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: historyUUID+ -> historyUUID +

long globalCheckpoint,
long minTranslogGeneration,
Path translogFile,
long generation, String translogUUID, String historyUUID) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap these to the next two lines like the rest of the parameters?

@s1monw
Copy link
Contributor

s1monw commented Sep 12, 2017

I am trying to follow the logic here and I am not fully behind why we need this on the transaction log level. From my perspective (and I might miss something) an ID on the lucene commit level would be enough (it could even just be a simple number starting with 0 which allows use to see how often it happened). We would create / update the ID every time except of when we use open mode OPEN_INDEX_AND_TRANSLOG or if we use the translog tool. This way we can simply bake it into the commit and we know if we are compatible with the primary? I would really like us to keep the translog as simple as possible and do most of the "can we do X" on the commit level.

@bleskes
Copy link
Contributor Author

bleskes commented Sep 12, 2017

From my perspective (and I might miss something) an ID on the lucene commit leve

I started with that - just a uuid in the commit. That is sadly not enough - we currently store the global checkpoint in the translog (ckp file) and recovery logic needs to rely on it and make sure it was written for the same history as on the primary and the same history the replica lucene index . Since there are stages in the recovery where we pave over the lucene index but don't yet touch the translog, we need to make sure that the checkpoint we read actually belongs to the lucene index. We currently don't really know which translog the checkpoint belongs to and this is why I added the translog uuid to it. I think it's a good integrity check. We can use that translog uuid to tie the checkpoint back to lucene, which contains the history uuid. However adding the history uuid to the checkpoint prepares the ground for 7.0 (where we won't need to deal with indices that don't have sequence numbers). At that point we want to move to use history uuid is the primary identifier and use the translog uuid as an internal integrity marker for the translog. This is will allows us to fully move into a "history based model" and drop the current hybrid of file generations and seq# (i.e., generations files become an internal implementation detail of the translog, with better isolation). If this doesn't feel natural to you, I can remove it and we can talk about it later.

simple number starting with 0

That's appealing but we want to use the id for out of cluster semantics as well.

@s1monw
Copy link
Contributor

s1monw commented Sep 12, 2017

ok lets iterate on this for a moment. The translog ID is fixed for the entire transaction log lifecycle so I don't understand why it's baked into the checkpoint. I really really dislike the fact that we write variable length strings into such a fragile file. I don't think this is necessary by any means. We can just write it into the file header or at some place like this since it never changes?

@bleskes
Copy link
Contributor Author

bleskes commented Sep 12, 2017

I really really dislike the fact that we write variable length strings into such a fragile file.

As far as I can tell these are fixed length (22 bytes + 1 byte for length). I agree that's important.

We can just write it into the file header or at some place like this since it never changes

It's written into the main translog files headers, so we can open up those and validate. I felt that was more complicated (now we open multiple files, first the ckp then the translog file generation it refers to). Also, we currently have no protection for people copying ckp files from one place to another. I feel such a protection is a nice bonus to add?

@s1monw
Copy link
Contributor

s1monw commented Sep 12, 2017

this checkpoint file was designed to be as small as possible and not the sink for all kinds of information. We tried to fight for every bit in there and now we add a bunch of random bytes to it I feel I like it's not the right thing. I tried to convince you about this the last time since there are so many unknowns about and this is such a fragile area. I personally feel this translog gets a bit out of control and I am very very worried about it. I don't think we should design our transaction log around some crazy stuff we do with S/R and tools to recover transaction logs. We should rather fix S/R and the tools instead of dumping all kinds of information into a super critical file. This entire PR feels wrong to me to be honest and I have a hard time to with it. if you want to add anything to it I think it should be a at most a long. Even that I feel like is wrong here for this purpose. We can think of protecting the situation where somebody copies over checkpoints but lets do this seperately it's an impl detail of a translog and it should be a checkpoint tight to a single .tlog file which is not the case. Also the file must have the right name as well here which is only a single step away from painting the hash into the file if you really want to trick us. I would really like us to look at each individual issue and try to solve it. I don't think this is the right way of doing this.

@bleskes
Copy link
Contributor Author

bleskes commented Sep 12, 2017

this checkpoint file was designed to be as small as possible and not the sink for all kinds of information. We tried to fight for every bit in there and now we add a bunch of random bytes to it I feel I like it's not the right thing. I tried to convince you about this the last time since there are so many unknowns about and this is such a fragile area.

We settled on deciding that the checkpoint file needs to stay below 512. I was aiming for the simplest possible code change while staying well below this limit. I can see your argument and I can move the history uuid to the translog files header (i.e., smaller checkpoint, more code). If this is what we end up doing, I'll move this to a separate change and only use the translog uuid for the validation here (i.e., opening the translog.ckp file and the associated translog generation file for the header).

I don't think we should design our transaction log around some crazy stuff we do with S/R and tools to recover transaction logs. We should rather fix S/R and the tools instead of dumping all kinds of information into a super critical file. This entire PR feels wrong to me to be honest and I have a hard time to with it.

There is more to this than S/R. I will reach out to discuss this on a different channel.

We can think of protecting the situation where somebody copies over checkpoints but lets do this seperately it's an impl detail of a translog and it should be a checkpoint tight to a single .tlog file which is not the case

As said this was just a nice to have - by no mean a watertight scheme.

@s1monw
Copy link
Contributor

s1monw commented Sep 12, 2017

We settled on deciding that the checkpoint file needs to stay below 512.

That wasn't my intention to give the impression we should make this the goal. The goal to me is to try very hard to keep it as small as possible. 512 is a hard stop IMO

@bleskes
Copy link
Contributor Author

bleskes commented Sep 13, 2017

I think that should not be possible any more. Would you double check and make this a hard assertion?

@jasontedor re ^^ - I might be missing something, but we can only do this when we actually do the snapshpot/restore change?

@@ -174,15 +177,23 @@ public InternalEngine(EngineConfig engineConfig) throws EngineException {
switch (openMode) {
case OPEN_INDEX_AND_TRANSLOG:
writer = createWriter(false);
String existingHistoryUUID = loadHistoryUUIDFromCommit(writer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determining whether to generate a fresh or reuse the existing historyUUID should be done based on the ShardRouting.recoverySource() and can be set in the engine config:

EMPTY_STORE, SNAPSHOT, LOCAL_SHARDS => create fresh history ID
EXISTING_STORE, PEER => use existing history id (if available, otherwise create fresh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion. I prefer to use it as a follow up PR when I actually fix the snapshot issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with a followup. good catch @ywelsch

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions.. looks great so far

private String loadHistoryUUIDFromCommit(final IndexWriter writer) throws IOException {
String uuid = commitDataAsMap(writer).get(HISTORY_UUID_KEY);
if (uuid == null) {
assert config().getIndexSettings().getIndexVersionCreated().before(Version.V_6_0_0_rc1) :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this a hard fail? I wonder if we should return a null uuid here it will just fail later?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, can do - it shouldn't happen anyway. I tend to communicate this "if this happens it's a bug" with assertions but I'm happy to convert to an IllegalState or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought - assertions cause nodes to die, which means our integration tests will be stopped on the spot - with an illegal state, the nodes we recover by doing another recovery and we'll never see the failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@@ -366,7 +367,7 @@ public static long getStartingSeqNo(final RecoveryTarget recoveryTarget) {
} else {
return SequenceNumbers.UNASSIGNED_SEQ_NO;
}
} catch (final IOException e) {
} catch (final IOException|TranslogCorruptedException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is this a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a left over from my previous change. Will remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@@ -340,6 +351,12 @@ private void recoverFromTranslogInternal() throws IOException {
flush(true, true);
} else if (translog.isCurrent(translogGeneration) == false) {
commitIndexWriter(indexWriter, translog, lastCommittedSegmentInfos.getUserData().get(Engine.SYNC_COMMIT_ID));
refreshLastCommittedSegmentInfos();
} else if (lastCommittedSegmentInfos.getUserData().containsKey(HISTORY_UUID_KEY) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also have to put it in the commit if the existing UUID is different to the one we have in the local var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the current logic it will be a bug - the history should stay the same for the index life time and only go from null -> a value. With Yannick's suggestion it might be needed but I prefer to do it as a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes
Copy link
Contributor Author

bleskes commented Sep 14, 2017

@jasontedor can you take a second look too when you have a few moments?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment, you can push when addressed.

@@ -162,10 +164,11 @@ void addIndices(
* document-level semantics.
*/
writer.setLiveCommitData(() -> {
final HashMap<String, String> liveCommitData = new HashMap<>(2);
final HashMap<String, String> liveCommitData = new HashMap<>(3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this should be four (it was already "wrong" at two). (It doesn't really matter, but if we are going to size the map, let's size it correctly and avoid a resize).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one. will fix.

@bleskes bleskes merged commit 1ca0b5e into elastic:master Sep 14, 2017
@bleskes
Copy link
Contributor Author

bleskes commented Sep 14, 2017

Thx @jasontedor @s1monw

bleskes added a commit that referenced this pull request Sep 14, 2017
)

The new ops based recovery, introduce as part of  #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).

The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).

We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
bleskes added a commit that referenced this pull request Sep 14, 2017
)

The new ops based recovery, introduce as part of  #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).

The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).

We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 16, 2017
* master:
  fix testSniffNodes to use the new error message
  Add check for invalid index in WildcardExpressionResolver (elastic#26409)
  Docs: Use single-node discovery.type for dev example
  Filter unsupported relation for range query builder (elastic#26620)
  Fix kuromoji default stoptags (elastic#26600)
  [Docs] Add description for missing fields in Reindex/Update/Delete By Query (elastic#26618)
  [Docs] Update ingest.asciidoc (elastic#26599)
  Better message text for ResponseException
  [DOCS] Remove edit link from ML node
  enable bwc testing
  fix StartRecoveryRequestTests.testSerialization
  Add bad_request to the rest-api-spec catch params (elastic#26539)
  Introduce a History UUID as a requirement for ops based recovery  (elastic#26577)
  Add missing catch arguments to the rest api spec (elastic#26536)
bleskes added a commit that referenced this pull request Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
bleskes added a commit that referenced this pull request Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
bleskes added a commit that referenced this pull request Sep 19, 2017
…#26694)

Restoring a shard from snapshot throws the primary back in time violating assumptions and bringing the validity of global checkpoints in question. To avoid problems, we should make sure that a shard that was restored will never be the source of an ops based recovery to a shard that existed before the restore. To this end we have introduced the notion of `histroy_uuid` in #26577 and required that both source and target will have the same history to allow ops based recoveries. This PR make sure that a shard gets a new uuid after restore.

As suggested by @ywelsch , I derived the creation of a `history_uuid` from the `RecoverySource` of the shard. Store recovery will only generate a uuid if it doesn't already exist (we can make this stricter when we don't need to deal with 5.x indices). Peer recovery follows the same logic (note that this is different than the approach in #26557, I went this way as it means that shards always have a history uuid after being recovered on a 6.x node and will also mean that a rolling restart is enough for old indices to step over to the new seq no model). Local shards and snapshot force the generation of a new translog uuid.

Relates #10708
Closes #26544
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
@jpountz jpountz removed the :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. label Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.0.0-rc1 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants