[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

Poojita-Raj · 2022-07-08T19:08:19Z

Describe the bug
Occasionally, with segment replication enabled, we see the below bug during the process of adding/refreshing/deleting documents:

java.lang.AssertionError: supposedly in-sync shard copy received a global checkpoint [0] that is higher than its local checkpoint [-1]

To Reproduce
Steps to reproduce the behavior:

Create an index with segment replication enabled
Try out different operation on the index - adding, deleting, refreshing, etc.
Occasionally, we will see a failure caused by the local checkpoint being behind the global checkpoint - which is supposed to be the global minimum checkpoint.

Expected behavior
The global checkpoint calculation must always take all primaries and replicas into account since its the global minimum checkpoint guaranteed to be processed on all nodes. We need to ensure this error isn't produced on regular operations on an index with segment replication.

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2022-07-12T03:02:15Z

Looking

dreamer-89 · 2022-07-12T17:26:36Z

I wasn't able to reproduce when starting server and manually doing ingestion, deletion & occasional refreshes.

Writing a test to perform these operations at a larger scale.

dreamer-89 · 2022-07-12T19:34:57Z

@Poojita-Raj : I am not able to reproduce this failure locally. I tried manually running server and wrote one integration test (below) without any success. Can you share more insights around how reproduce this ?

public void testRestoreOnSegRep_issue3832() throws Exception {
        initSetup(addShardSettings(segRepEnableIndexSettings()));
        // Perform indexing, deletion & refresh in a loop
        final int numDocs = scaledRandomIntBetween(10000, 100000);
        int delCount = 0, refreshCount = 0;
        for (int i = 0; i < numDocs; i++) {
            client().prepareIndex().setIndex(INDEX_NAME).setId(String.valueOf(i)).setSource("{\"foo\": \"bar\"}", XContentType.JSON).get();
            if (i%5 == 0 && randomBoolean()) {
                delCount++;
                client().prepareDelete(INDEX_NAME, String.valueOf(i));
            }
            if (i%10 == 0 && randomBoolean()) {
                refreshCount++;
                client().admin().indices().prepareRefresh(INDEX_NAME).get();
            }
        }
        logger.info("Del count {}, refresh count {}", delCount, refreshCount);
        ensureGreen(INDEX_NAME);
    }

Branch: https://github.com/dreamer-89/OpenSearch/commits/segrep_snapshot (note, needed to pull in delete doc related fix).
OS: macOS

dreamer-89 · 2022-07-14T00:20:39Z

Thanks to @mch2. The issue is reproducible when a replica is started during indexing operation on primary. The fix is tracked in PR #3743

  public void testReplicaRecover() throws Exception {
        final String primary = internalCluster().startNode();
        createIndex(INDEX_NAME, Settings.builder().put(indexSettings()).put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0).build());
        ensureGreen(INDEX_NAME);

        final int initialDocCount = scaledRandomIntBetween(0, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            refresh(INDEX_NAME);
            final String replica = internalCluster().startNode();
            assertAcked(
                client().admin()
                    .indices()
                    .prepareUpdateSettings(INDEX_NAME)
                    .setSettings(Settings.builder().put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 1))
            );
            ensureGreen(INDEX_NAME);
            waitForDocs(initialDocCount, indexer);
        }

Rishikesh1159 · 2022-08-08T20:18:06Z

Closing this as it is not able to reproduce. Please feel free to reopen it, if there are solid steps to repro this.

Poojita-Raj added bug Something isn't working untriaged labels Jul 8, 2022

mch2 added distributed framework and removed untriaged labels Jul 11, 2022

mch2 assigned dreamer-89 Jul 11, 2022

Rishikesh1159 closed this as completed Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

Poojita-Raj commented Jul 8, 2022

dreamer-89 commented Jul 12, 2022

dreamer-89 commented Jul 12, 2022

dreamer-89 commented Jul 12, 2022 •

edited

Loading

dreamer-89 commented Jul 14, 2022 •

edited

Loading

Rishikesh1159 commented Aug 8, 2022

[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

Comments

Poojita-Raj commented Jul 8, 2022

dreamer-89 commented Jul 12, 2022

dreamer-89 commented Jul 12, 2022

dreamer-89 commented Jul 12, 2022 • edited Loading

dreamer-89 commented Jul 14, 2022 • edited Loading

Rishikesh1159 commented Aug 8, 2022

dreamer-89 commented Jul 12, 2022 •

edited

Loading

dreamer-89 commented Jul 14, 2022 •

edited

Loading