Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] local checkpoint falling behind global checkpoint #3832

Closed
Poojita-Raj opened this issue Jul 8, 2022 · 5 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@Poojita-Raj
Copy link
Contributor

Describe the bug
Occasionally, with segment replication enabled, we see the below bug during the process of adding/refreshing/deleting documents:

java.lang.AssertionError: supposedly in-sync shard copy received a global checkpoint [0] that is higher than its local checkpoint [-1]

To Reproduce
Steps to reproduce the behavior:

  1. Create an index with segment replication enabled
  2. Try out different operation on the index - adding, deleting, refreshing, etc.
  3. Occasionally, we will see a failure caused by the local checkpoint being behind the global checkpoint - which is supposed to be the global minimum checkpoint.

Expected behavior
The global checkpoint calculation must always take all primaries and replicas into account since its the global minimum checkpoint guaranteed to be processed on all nodes. We need to ensure this error isn't produced on regular operations on an index with segment replication.

@dreamer-89
Copy link
Member

Looking

@dreamer-89
Copy link
Member

I wasn't able to reproduce when starting server and manually doing ingestion, deletion & occasional refreshes.

Writing a test to perform these operations at a larger scale.

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 12, 2022

@Poojita-Raj : I am not able to reproduce this failure locally. I tried manually running server and wrote one integration test (below) without any success. Can you share more insights around how reproduce this ?

public void testRestoreOnSegRep_issue3832() throws Exception {
        initSetup(addShardSettings(segRepEnableIndexSettings()));
        // Perform indexing, deletion & refresh in a loop
        final int numDocs = scaledRandomIntBetween(10000, 100000);
        int delCount = 0, refreshCount = 0;
        for (int i = 0; i < numDocs; i++) {
            client().prepareIndex().setIndex(INDEX_NAME).setId(String.valueOf(i)).setSource("{\"foo\": \"bar\"}", XContentType.JSON).get();
            if (i%5 == 0 && randomBoolean()) {
                delCount++;
                client().prepareDelete(INDEX_NAME, String.valueOf(i));
            }
            if (i%10 == 0 && randomBoolean()) {
                refreshCount++;
                client().admin().indices().prepareRefresh(INDEX_NAME).get();
            }
        }
        logger.info("Del count {}, refresh count {}", delCount, refreshCount);
        ensureGreen(INDEX_NAME);
    }

Branch: https://github.com/dreamer-89/OpenSearch/commits/segrep_snapshot (note, needed to pull in delete doc related fix).
OS: macOS

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 14, 2022

Thanks to @mch2. The issue is reproducible when a replica is started during indexing operation on primary. The fix is tracked in PR #3743

  public void testReplicaRecover() throws Exception {
        final String primary = internalCluster().startNode();
        createIndex(INDEX_NAME, Settings.builder().put(indexSettings()).put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0).build());
        ensureGreen(INDEX_NAME);

        final int initialDocCount = scaledRandomIntBetween(0, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            refresh(INDEX_NAME);
            final String replica = internalCluster().startNode();
            assertAcked(
                client().admin()
                    .indices()
                    .prepareUpdateSettings(INDEX_NAME)
                    .setSettings(Settings.builder().put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 1))
            );
            ensureGreen(INDEX_NAME);
            waitForDocs(initialDocCount, indexer);
        }

@Rishikesh1159
Copy link
Member

Closing this as it is not able to reproduce. Please feel free to reopen it, if there are solid steps to repro this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

4 participants