[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

Jeevananthan-23 · 2023-02-18T18:56:49Z

Hello @mikemccand / @mch2, I could like to understand incontext of how shard promotion (Leader Election) works with the below proposals. May why not consider distributed consensus algorithm like RAFT.

Send a _{n} (n is larger than the largest segment counter in current SegmentInfos) to master node before segment replication progress so that master can tell the newly promoted Primary Shard not to generate any segment less than _{n}. To reduce the pressure of master node, we don't need to send this information every time. For example, if the max segment is _4.si in primary's current SegmentInfos, we can send _rw(or 1004 in decimal) to master node. After segment increases to _rw.si, we send _1jo(or 2004 in decimal) to master node.

what happen when master primary shard dies at first inside the cluster and newly prometed primary shard has same segment number and how promotion happens?

Before choosing a replica doing promotion progress, master node must ask all replicas whose replication state is newest.

Using distributed consensus algorithm like Raft should be the great choice because copying the merged segment and transfer to replicas and support learder election as @mikemccand mentioned in his blog Segment Replication cluster state.

Originally posted by @Jeevananthan-23 in #2212 (comment)

The text was updated successfully, but these errors were encountered:

Jeevananthan-23 · 2023-02-24T20:03:07Z

Tracking existing `Shard Promotion`(Leader Election):

Segment counter [Segment Replication] Bump segment infos counter before commit during replica promotion #4365
Replica checkpoints [Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988 & [Segment Replicatin] [Bug] Replica shard failure due to different files during get checkpoint info #4295
SegmentCommitInfo [BUG] Evaulate and update all generation fields in SegmentCommitInfo during primary promotion #5701
Translog from Remote [BUG] Assertion failure in replica promotion using remote translog #6195
Raft consensus
- The raft consensus algorithm also uses translog( AppendEntry ), the randomized timeout for leader election, and more.

mch2 · 2023-02-24T22:09:59Z

@Jeevananthan-23 Thanks for raising this! Consensus is useful for electing cluster manager nodes, but I don't think it's required on primary failure.

#2212 is around handling failover within a replication group with segment replication enabled. During the failover case today, the cluster manager node makes a decision on which replica should be elected as the new primary within the replication group here, by considering only if the candidate is active & selecting the furthest ahead in terms of OpenSearch version. With segment replication, we also want to take into account the candidate's latest SegmentInfos version. We want to do this to ensure that we are 1) not reindexing documents that have already been indexed and 2) to avoid creating new segments of the same name that already exist somewhere within the replication group.

what happen when master primary shard dies at first inside the cluster and newly promoted primary shard has same segment number and how promotion happens?

This is the desired case with segment replication. The newly promoted primary would have previously been syncing segments with the old primary, so it will have up to the old primary's latest segments at the time of failover. The new primary will continue indexing and create new segments that no other replica in the group has.

If the newly elected primary is behind the old primary but another replica in the replication group is up to date, this is where the conflict occurs. The newly elected primary will in this case replay form its translog after promotion & create new segments with the same name as existing segments on the other replica. #4365 was an attempt to prevent the newly elected primary from creating new segments with a name higher than that of a segment on a pre existing replica. However, this solution is not fool proof, we only bump the counter (which drives the segment name) by an arbitrary amount, so if the newly elected primary was behind the old by more than that amount, we could still see conflicts. If this happens, the newly elected primary will continue, yet the replicas would fail & need recovery.

Send a _{n} (n is larger than the largest segment counter in current SegmentInfos) to master node before segment replication progress so that master can tell the newly promoted Primary Shard not to generate any segment less than _{n}. To reduce the pressure of master node, we don't need to send this information every time. For example, if the max segment is _4.si in primary's current SegmentInfos, we can send _rw(or 1004 in decimal) to master node. After segment increases to _rw.si, we send _1jo(or 2004 in decimal) to master node.

This was a suggestion to store the former primary's state within cluster state, so that we increase the counter by a known amount instead of some arbitrary long.

IMO we should update this logic that executes on cluster managers to fetch the latest checkpoint from all candidate replicas, and select the one with the highest value, this would add some latency to fetch from each replica, but I can't imagine it being too expensive in exchange for correctness. Alternatively, we could store in cluster state after each replica updates to a new set of segments, so that cluster managers already have this state, yet this would be a frequent update.

Jeevananthan-23 · 2023-03-06T06:30:29Z

@Jeevananthan-23 Thanks for raising this! Consensus is useful for electing cluster manager nodes, but I don't think it's required on primary failure.

@mch2 Sorry, for the late reply had some research on existing ElasticSearch consensus solutions they also don't relay on Raft consensus.

#2212 is around handling failover within a replication group with segment replication enabled. During the failover case today, the cluster manager node makes a decision on which replica should be elected as the new primary within the replication group here, by considering only if the candidate is active & selecting the furthest ahead in terms of OpenSearch version. With segment replication, we also want to take into account the candidate's latest SegmentInfos version. We want to do this to ensure that we are 1) not reindexing documents that have already been indexed and 2) to avoid creating new segments of the same name that already exist somewhere within the replication group.

As you mentioned here the new primary promotion must be accountable with latest SegmentInfos version.

what happen when master primary shard dies at first inside the cluster and newly promoted primary shard has same segment number and how promotion happens?

This is the desired case with segment replication. The newly promoted primary would have previously been syncing segments with the old primary, so it will have up to the old primary's latest segments at the time of failover. The new primary will continue indexing and create new segments that no other replica in the group has.

If the newly elected primary is behind the old primary but another replica in the replication group is up to date, this is where the conflict occurs. The newly elected primary will in this case replay form its translog after promotion & create new segments with the same name as existing segments on the other replica. #4365 was an attempt to prevent the newly elected primary from creating new segments with a name higher than that of a segment on a pre existing replica. However, this solution is not fool proof, we only bump the counter (which drives the segment name) by an arbitrary amount, so if the newly elected primary was behind the old by more than that amount, we could still see conflicts. If this happens, the newly elected primary will continue, yet the replicas would fail & need recovery.

My proposal here is that should consider translog for promotion at first to newly electe primary by using Raft.

Send a _{n} (n is larger than the largest segment counter in current SegmentInfos) to master node before segment replication progress so that master can tell the newly promoted Primary Shard not to generate any segment less than _{n}. To reduce the pressure of master node, we don't need to send this information every time. For example, if the max segment is _4.si in primary's current SegmentInfos, we can send _rw(or 1004 in decimal) to master node. After segment increases to _rw.si, we send _1jo(or 2004 in decimal) to master node.

This was a suggestion to store the former primary's state within cluster state, so that we increase the counter by a known amount instead of some arbitrary long.

IMO we should update this logic that executes on cluster managers to fetch the latest checkpoint from all candidate replicas, and select the one with the highest value, this would add some latency to fetch from each replica, but I can't imagine it being too expensive in exchange for correctness. Alternatively, we could store in cluster state after each replica updates to a new set of segments, so that cluster managers already have this state, yet this would be a frequent update.

How is the latest checkpoint fetch from the sequence-number based replication as you mentioned have some latency this point the right implementation of Raft for coordination should help.

I know that this may have difficulties to implement but should be looking forward to benchmarking #2583 results.

Thanks!

shwetathareja · 2023-05-15T18:40:28Z

IMO we should update this logic that executes on cluster managers to fetch the latest checkpoint from all candidate replicas, and select the one with the highest value, this would add some latency to fetch from each replica, but I can't imagine it being too expensive in exchange for correctness.

+1 to add the logic to fetch the latest checkpoint before promoting a replica to primary. Whenever you choose to implement it, a note on the code reference for RoutingNodes. That logic is executed when processing new cluster state which executes in single threaded executor for cluster state updates

OpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java

Line 527 in 7472aa9

    
           maybeUpdatedState = applyFailedShards(currentState, failedShardsToBeApplied, staleShardsToBeApplied);

so don't update that logic directly. Rather add a separate transport logic to first fetch this information then promote replica for segrep indices.

Alternatively, we could store in cluster state after each replica updates to a new set of segments, so that cluster managers already have this state, yet this would be a frequent update.

This might turn out to be expensive if segments are created every few secs, this could result in too many requests to ClusterManager and wouldn't be preferred. ClusterManager shouldn't be in Indexing direct path.

github-actions bot added the untriaged label Feb 18, 2023

anasalkouz added distributed framework enhancement Enhancement or improvement to existing feature or request labels Feb 21, 2023

Jeevananthan-23 changed the title ~~[Q] Should consider using RAFT consensus algorithm for Segment replication~~ [Segment Replication] Should consider using RAFT consensus algorithm for Segment replication Feb 21, 2023

anasalkouz removed the untriaged label Feb 21, 2023

mch2 mentioned this issue Feb 23, 2023

[BUG] Evaulate and update all generation fields in SegmentCommitInfo during primary promotion #5701

Closed

Jeevananthan-23 mentioned this issue Mar 14, 2023

Feature Proposal : Pluggable Translog #1319

Open

anasalkouz added Migration:Backlog and removed Migration:Backlog labels Mar 17, 2023

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

Jeevananthan-23 commented Feb 18, 2023 •

edited

Loading

Jeevananthan-23 commented Feb 24, 2023 •

edited

Loading

mch2 commented Feb 24, 2023 •

edited

Loading

Jeevananthan-23 commented Mar 6, 2023

shwetathareja commented May 15, 2023

[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

Comments

Jeevananthan-23 commented Feb 18, 2023 • edited Loading

Jeevananthan-23 commented Feb 24, 2023 • edited Loading

Tracking existing Shard Promotion(Leader Election):

mch2 commented Feb 24, 2023 • edited Loading

Jeevananthan-23 commented Mar 6, 2023

shwetathareja commented May 15, 2023

Jeevananthan-23 commented Feb 18, 2023 •

edited

Loading

Jeevananthan-23 commented Feb 24, 2023 •

edited

Loading

Tracking existing `Shard Promotion`(Leader Election):

mch2 commented Feb 24, 2023 •

edited

Loading