Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Fix bug where replica shows stale doc count during engine reset. #9495

Merged
merged 4 commits into from
Aug 28, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Aug 22, 2023

Description

This change fixes an issue where replica shards can temporarily return stale results during promotion to primary. This happens in resetEngineToGlobalCheckpoint because the shard temporarily flips to a ReadOnly engine and initializes its reader from disk. This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Related Issues

related #8985

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #9495 (3606e06) into main (569d5c2) will decrease coverage by 0.04%.
Report is 2 commits behind head on main.
The diff coverage is 65.71%.

@@             Coverage Diff              @@
##               main    #9495      +/-   ##
============================================
- Coverage     71.15%   71.11%   -0.04%     
+ Complexity    57536    57451      -85     
============================================
  Files          4781     4781              
  Lines        271197   271227      +30     
  Branches      39595    39599       +4     
============================================
- Hits         192975   192891      -84     
- Misses        62011    62119     +108     
- Partials      16211    16217       +6     
Files Changed Coverage Δ
.../opensearch/telemetry/tracing/noop/NoopTracer.java 57.14% <0.00%> (-9.53%) ⬇️
...rg/opensearch/telemetry/tracing/WrappedTracer.java 78.57% <50.00%> (-6.05%) ⬇️
.../opensearch/index/engine/NRTReplicationEngine.java 77.14% <58.82%> (-2.24%) ⬇️
...elemetry/tracing/OTelTracingContextPropagator.java 73.33% <70.00%> (-1.67%) ⬇️
...rg/opensearch/telemetry/tracing/DefaultTracer.java 93.10% <100.00%> (+0.51%) ⬆️
...racing/ThreadContextBasedTracerContextStorage.java 74.19% <100.00%> (+0.86%) ⬆️

... and 503 files with indirect coverage changes

@opensearch-trigger-bot
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change d1678ba

Incompatible components

Incompatible components: [https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/security-analytics.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the shard temporarily flips to a ReadOnly engine and initializes its reader from disk.

Is it convenient to add unit test which mimics this state

@maosuhan
Copy link

maosuhan commented Aug 28, 2023

@mch2 I have copied the MR to my local repo, it does solve my problem now. The search result on bumping replica will not produce stale data now.

https://github.com/maosuhan/OpenSearch/tree/fix_sr
org.opensearch.indices.replication.SegmentReplicationPrimaryPromotionIT

@mch2 mch2 requested a review from msfroh as a code owner August 28, 2023 20:03
@mch2
Copy link
Member Author

mch2 commented Aug 28, 2023

because the shard temporarily flips to a ReadOnly engine and initializes its reader from disk.

Is it convenient to add unit test which mimics this state

added a better test to SegmentReplicationIndexShardTests

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@mch2
Copy link
Member Author

mch2 commented Aug 28, 2023

pushed a rebase - test failures look unrelated.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 merged commit 012c4fa into opensearch-project:main Aug 28, 2023
10 of 12 checks passed
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Aug 28, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-9495-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 012c4fa4ec8b719cecd4e163c5fdc0e4f42679d3
# Push it to GitHub
git push --set-upstream origin backport/backport-9495-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-9495-to-2.x.

mch2 added a commit to mch2/OpenSearch that referenced this pull request Aug 28, 2023
…ing engine reset. (opensearch-project#9495)

* Fix bug where replica shows stale doc count during engine reset.

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add changelog entry.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add unit test for search during engine reset.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove useless test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
mch2 added a commit that referenced this pull request Aug 29, 2023
…ing engine reset. (#9495) (#9595)

* Fix bug where replica shows stale doc count during engine reset.

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.



* Add changelog entry.



* Add unit test for search during engine reset.



* Remove useless test.



---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
kaushalmahi12 pushed a commit to kaushalmahi12/OpenSearch that referenced this pull request Sep 12, 2023
…ing engine reset. (opensearch-project#9495)

* Fix bug where replica shows stale doc count during engine reset.

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add changelog entry.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add unit test for search during engine reset.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove useless test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Kaushal Kumar <ravi.kaushal97@gmail.com>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
…ing engine reset. (opensearch-project#9495)

* Fix bug where replica shows stale doc count during engine reset.

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add changelog entry.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add unit test for search during engine reset.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove useless test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Ivan Brusic <ivan.brusic@flocksafety.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
…ing engine reset. (opensearch-project#9495)

* Fix bug where replica shows stale doc count during engine reset.

This change fixes an issue where replica shards can temporarily return stale results while converting to a RO engine during an engine reset.  This is possible because NRTReplicationEngine did not previously implement flush and the freshest data is only active on the reader. Fixed by implementing flush and also honoring acquireLatestCommit's flushFirst parameter.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add changelog entry.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add unit test for search during engine reset.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove useless test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch backport-failed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants