Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] [BUG] Document update action fails segment replication #3787

Closed
Tracked by #3969
dreamer-89 opened this issue Jul 6, 2022 · 4 comments
Closed
Tracked by #3969
Assignees
Labels
bug Something isn't working distributed framework

Comments

@dreamer-89
Copy link
Member

Describe the bug
Existing update documents (update/delete) cause segment replication failures.

To Reproduce
Steps to reproduce the behavior:

  1. Go to mch2/wireAfterCopy branch
  2. Run testDeleteOperations test
  3. Fails with below stack trace.
[2022-07-06T12:13:25,926][DEBUG][o.o.i.r.SegmentReplicationTarget] [node_t0] [test-idx-1][0] Replication diff RecoveryDiff{identical=[name [_1.cfe], length [479], checksum [me8w6i], writtenBy [9.3.0], name [_1.si], length [363], checksum [1tjxs3x], writtenBy [9.3.0], name [_1.cfs], length [13827], checksum [1q6sqie], writtenBy [9.3.0], name [segments_2], length [208], checksum [13zhw9i], writtenBy [9.3.0]], different=[name [_0.si], length [363], checksum [zcklpw], writtenBy [9.3.0], name [_0.cfs], length [47268], checksum [1pdi3gw], writtenBy [9.3.0], name [_0.cfe], length [479], checksum [f87ghx], writtenBy [9.3.0]], missing=[name [_0_1_Lucene90_0.dvm], length [160], checksum [zmgvqe], writtenBy [9.3.0], name [_0_1_Lucene90_0.dvd], length [87], checksum [smuqs], writtenBy [9.3.0], name [_0_1.fnm], length [1205], checksum [pd6goq], writtenBy [9.3.0], name [_2.si], length [363], checksum [zl0at7], writtenBy [9.3.0], name [_2.cfs], length [2372], checksum [16hup6q], writtenBy [9.3.0], name [_2.cfe], length [405], checksum [1uzd7pf], writtenBy [9.3.0]]}
[2022-07-06T12:13:25,927][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t0] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$4.onFailure(SegmentReplicationTargetService.java:292) [main/:?]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [main/:?]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:327) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:175) [main/:?]
	at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:156) [main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [main/:?]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:327) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [main/:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [main/:?]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [main/:?]
	at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1360) [main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:739) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalStateException: Shard [test-idx-1][0] has local copies of segments that differ from the primary [name [_0.si], length [363], checksum [zcklpw], writtenBy [9.3.0], name [_0.cfs], length [47268], checksum [1pdi3gw], writtenBy [9.3.0], name [_0.cfe], length [479], checksum [f87ghx], writtenBy [9.3.0]]
	at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:181) ~[main/:?]
	... 24 more
[2022-07-06T12:13:25,933][DEBUG][o.o.i.t.Translog         ] [node_t0] [test-idx-1][0] translog closed
[2022-07-06T12:13:25,933][DEBUG][o.o.i.e.Engine           ] [node_t0] [test-idx-1][0] engine closed [engine failed on: [replication failure]]
[2022-07-06T12:13:25,933][WARN ][o.o.i.e.Engine           ] [node_t0] [test-idx-1][0] failed engine [replication failure]

Expected behavior
Delete operations related tests should pass

This is a part of the process of merging our feature branch - feature/segment-replication - back into main by re-PRing our changes from the feature branch.
The breakdown of the plan to merge to main is detailed here: #2355
For added context on segment replication - here's the design proposal #2229

@dreamer-89 dreamer-89 added bug Something isn't working untriaged and removed untriaged labels Jul 6, 2022
@dreamer-89
Copy link
Member Author

The tests fails during segment replication step. Existing flow has one validation step to verify that MetadataSnapshot on replica should match exactly with primary for already present segments on replica.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jul 6, 2022

From failure trace, Replication diff is misleading as the underlying method used for evaluate difference MetaDataSnapshot.recoveryDiff surfaces identical files to caller rather than the one which are missing in replica snapshot here.

[2022-07-06T12:13:25,926][DEBUG][o.o.i.r.SegmentReplicationTarget] [node_t0] [test-idx-1][0] Replication diff RecoveryDiff{identical=[name [_1.cfe], length [479], checksum [me8w6i], writtenBy [9.3.0], name [_1.si], length [363], checksum [1tjxs3x], writtenBy [9.3.0], name [_1.cfs], length [13827], checksum [1q6sqie], writtenBy [9.3.0], name [segments_2], length [208], checksum [13zhw9i], writtenBy [9.3.0]], different=[name [_0.si], length [363], checksum [zcklpw], writtenBy [9.3.0], name [_0.cfs], length [47268], checksum [1pdi3gw], writtenBy [9.3.0], name [_0.cfe], length [479], checksum [f87ghx], writtenBy [9.3.0]], missing=[name [_0_1_Lucene90_0.dvm], length [160], checksum [zmgvqe], writtenBy [9.3.0], name [_0_1_Lucene90_0.dvd], length [87], checksum [smuqs], writtenBy [9.3.0], name [_0_1.fnm], length [1205], checksum [pd6goq], writtenBy [9.3.0], name [_2.si], length [363], checksum [zl0at7], writtenBy [9.3.0], name [_2.cfs], length [2372], checksum [16hup6q], writtenBy [9.3.0], name [_2.cfe], length [405], checksum [1uzd7pf], writtenBy [9.3.0]]}

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jul 6, 2022

The actual cause of failure is missing segment files (created by lucene to track live/deleted documents) on replica specific to a segment number. The existing checkpoint comparator fails when it sees new files from primary's checkpoint info for a specific segment during comparison.

Wrote a quick test in lucene library TestIndexWriter which confirms that delete operations creates .liv files (storing info about live documents). The _0_1.liv file is generated with commit post deleteDocuments action. Though, the behavior is slightly different on OpenSearch engine which generates .dvd & .tvd files instead of .liv on primary post commit operation (IMMEDIATE referesh policy). I think we need to change existing mechanism to evaluate diff between replica & primary; which currently prevents replica from accepting new segment number specific files (generated to track deleted documents).

  public void testDeleteDocuments throws Exception {
    Path path = createTempDir();
    Directory d = newFSDirectory(path);
    IndexWriter w = new IndexWriter(d, new IndexWriterConfig(new MockAnalyzer(random())));
    for (int i = 0; i < 20; i++) {
      addDocWithIndex(w, i);
    }
    w.commit();
    w.deleteDocuments(new Term("id", "0"));
    w.deleteDocuments(new Term("id", "1"));
    w.commit();
  }

Screen Shot 2022-07-06 at 10 43 16 AM

@dreamer-89
Copy link
Member Author

dreamer-89 commented Aug 11, 2022

Picked up @Poojita-Raj changes from #4117. Raised new PR:

  1. Addressed review comments from @mch2
  2. Added unit test which creates actual store directory & perform comparison when deletions are performed.

PR #4185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
3 participants