Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added timing data and more granular stages to SegmentReplicationState #4222

Merged
merged 4 commits into from
Aug 17, 2022

Conversation

kartg
Copy link
Member

@kartg kartg commented Aug 16, 2022

Description

This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.

Signed-off-by: Kartik Ganesh gkart@amazon.com

Issues Resolved

Part of #2583

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kartg kartg requested review from a team and reta as code owners August 16, 2022 02:23
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Looks unrelated, refiring.

 REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingFirstReverse" -Dtests.seed=9E522164F9706EE1 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv -Dtests.timezone=Pacific/Kwajalein -Druntime.java=17
  2> java.lang.AssertionError: expected:<1201> but was:<1198>

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Looks like a legit failure. @kartg you may want to look

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.indices.replication.SegmentReplicationTargetTests.testFailureResponse_getSegmentFiles" -Dtests.seed=C313BB5FDBB52A99 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ms-MY -Dtests.timezone=Asia/Bangkok -Druntime.java=17

org.opensearch.indices.replication.SegmentReplicationTargetTests > testFailureResponse_getSegmentFiles FAILED
    java.lang.AssertionError: can't move replication to stage [DONE]. current stage: [GET_FILES] (expected [FINALIZE_REPLICATION])
        at __randomizedtesting.SeedInfo.seed([C313BB5FDBB52A99:7591A80679BD22D6]:0)
        at org.opensearch.indices.replication.SegmentReplicationState.validateAndSetStage(SegmentReplicationState.java:113)
        at org.opensearch.indices.replication.SegmentReplicationState.setStage(SegmentReplicationState.java:150)
        at org.opensearch.indices.replication.SegmentReplicationTarget.onDone(SegmentReplicationTarget.java:88)
        at org.opensearch.indices.replication.common.ReplicationTarget.markAsDone(ReplicationTarget.java:141)
        at org.opensearch.indices.replication.SegmentReplicationTargetTests.tearDown(SegmentReplicationTargetTests.java:501)

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@kartg
Copy link
Member Author

kartg commented Aug 16, 2022

Looks like two flaky test failures:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskIdPersistsInThreadContext" -Dtests.seed=EFD06E521F630F36 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=de-DE -Dtests.timezone=America/Santarem -Druntime.java=17

org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests > testTaskIdPersistsInThreadContext FAILED
    java.lang.AssertionError: 
    Expected: iterable with items [<1L>] in any order
         but: no match for: <2L>
        at __randomizedtesting.SeedInfo.seed([EFD06E521F630F36:9F72BFDDEAB30CF]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.junit.Assert.assertThat(Assert.java:930)
        at org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskIdPersistsInThreadContext(ResourceAwareTasksTests.java:620)
REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast" -Dtests.seed=EFD06E521F630F36 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ar-AE -Dtests.timezone=Asia/Muscat -Druntime.java=17

org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests > testSortMissingLast FAILED
    java.lang.AssertionError: expected:<1105> but was:<1085>
        at __randomizedtesting.SeedInfo.seed([EFD06E521F630F36:7AFE1DBA9C25A37]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)

and one legitimate test failure related to segment replication -

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.testNewCheckpointBehindCurrentCheckpoint" -Dtests.seed=EFD06E521F630F36 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sv -Dtests.timezone=Africa/Maputo -Druntime.java=17

org.opensearch.indices.replication.SegmentReplicationTargetServiceTests > testNewCheckpointBehindCurrentCheckpoint FAILED
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=135, name=opensearch[org.opensearch.indices.replication.SegmentReplicationTargetServiceTests][generic][T#3], state=RUNNABLE, group=TGRP-SegmentReplicationTargetServiceTests]
        at __randomizedtesting.SeedInfo.seed([EFD06E521F630F36:7513D7183489253A]:0)

        Caused by:
        java.lang.AssertionError: can't move replication to stage [DONE]. current stage: [REPLICATING] (expected [FINALIZE_REPLICATION])
            at __randomizedtesting.SeedInfo.seed([EFD06E521F630F36]:0)
            at org.opensearch.indices.replication.SegmentReplicationState.validateAndSetStage(SegmentReplicationState.java:113)
            at org.opensearch.indices.replication.SegmentReplicationState.setStage(SegmentReplicationState.java:150)
            at org.opensearch.indices.replication.SegmentReplicationTarget.onDone(SegmentReplicationTarget.java:88)
            at org.opensearch.indices.replication.common.ReplicationTarget.markAsDone(ReplicationTarget.java:141)
            at org.opensearch.indices.replication.common.ReplicationCollection.markAsDone(ReplicationCollection.java:202)
            at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onResponse(SegmentReplicationTargetService.java:246)
            at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onResponse(SegmentReplicationTargetService.java:243)
            at org.opensearch.indices.replication.SegmentReplicationTargetServiceTests.lambda$testTargetReturnsSuccess_listenerCompletes$0(SegmentReplicationTargetServiceTests.java:106)

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@kartg
Copy link
Member Author

kartg commented Aug 16, 2022

One failing test, unrelated to segment replication, but i can consistently repro the failure locally with the given seed

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast" -Dtests.seed=BC7C1431A4F70C5D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ga-IE -Dtests.timezone=Europe/Belgrade -Druntime.java=17

org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests > testSortMissingLast FAILED
    java.lang.AssertionError: expected:<1105> but was:<1023>

@reta
Copy link
Collaborator

reta commented Aug 16, 2022

One failing test, unrelated to segment replication, but i can consistently repro the failure locally with the given seed

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests.testSortMissingLast" -Dtests.seed=BC7C1431A4F70C5D -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ga-IE -Dtests.timezone=Europe/Belgrade -Druntime.java=17

org.opensearch.index.fielddata.SortedSetDVStringFieldDataTests > testSortMissingLast FAILED
    java.lang.AssertionError: expected:<1105> but was:<1023>

@kartg could you please create an issue for this flaky test, I suspect it has been introduced in [1], first time spotted there [2]

[1] #4183
[2] #4183 (comment)

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.

Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
Signed-off-by: Kartik Ganesh <gkart@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@kartg kartg requested a review from mch2 August 17, 2022 17:19
@kartg
Copy link
Member Author

kartg commented Aug 17, 2022

@reta @dreamer-89 @mch2 test issues resolved, ready for your 👀

* Segments are immutable. So if the replica has any segments with the same name that differ from the one in the incoming
* snapshot from source that means the local copy of the segment has been corrupted/changed in some way and we throw an
* IllegalStateException to fail the shard
*/
if (diff.different.isEmpty() == false) {
getFilesListener.onFailure(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly an existing code, but it seems like there is a bug here: we don't thrown an exception but call the listener direclty and move on ... @kartg wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reta Good catch! We shouldn't be proceeding if the failure listener is invoked. Would you mind if i incorporated the fix for this as a follow-up PR? I'd like to keep the scope of this PR focused on the seg-rep stages and timing data.

// Get list of files to copy from this checkpoint.
state.setStage(SegmentReplicationState.Stage.GET_CHECKPOINT_INFO);
source.getCheckpointMetadata(getId(), checkpoint, checkpointInfoListener);

checkpointInfoListener.whenComplete(checkpointInfo -> getFiles(checkpointInfo, getFilesListener), listener::onFailure);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an issue between stage demarcation and handlng the unhappy path. @kartg please correct me if I am wrong but in case of failure checkpointInfoListener (and others) would delegate to listener::onFailure without marking the current stage as "finished", correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a conscious choice. We're treating the DONE stage as a successful replication run. I opted to avoid an explicit FAILED stage since the target and state objects are local to each replication, and the replication stages are currently only relevant when tracking timing data

Copy link
Collaborator

@reta reta Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine (I think), the dangling stage is what concerns me: since we publish timing on success only, we do nothing on failure (we don't even know if stage was run, which could be even more important than success for troubleshooting). In any case, leaving the decision up to you.

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kartg kartg merged commit a2ba3a8 into opensearch-project:main Aug 17, 2022
@kartg kartg deleted the segRepBenchmarking branch August 17, 2022 22:28
@mch2 mch2 added pending backport Identifies an issue or PR that still needs to be backported backport 2.x Backport to 2.x branch labels Sep 1, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 1, 2022
…#4222)

* Added timing data and more granular stages to SegmentReplicationState

This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Fixing SegmentReplicationTarget tests

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Incorporated PR feedback

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Fixing SegmentReplicationTargetService tests

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

Signed-off-by: Kartik Ganesh <gkart@amazon.com>
(cherry picked from commit a2ba3a8)
mch2 added a commit that referenced this pull request Sep 1, 2022
…eplicationState (#4367)

* Added timing data and more granular stages to SegmentReplicationState (#4222)

* Added timing data and more granular stages to SegmentReplicationState

This change introduces instrumentation logging that measures the latency of the various stages of segment replication as seen by each replica. Logs have also been added to the source node for checkpoint publishing and checkpoint metadata responses. All logging is currently at the TRACE level.

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Fixing SegmentReplicationTarget tests

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Incorporated PR feedback

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

* Fixing SegmentReplicationTargetService tests

Signed-off-by: Kartik Ganesh <gkart@amazon.com>

Signed-off-by: Kartik Ganesh <gkart@amazon.com>
(cherry picked from commit a2ba3a8)

* Update changelog for backport pr.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Kartik Ganesh <gkart@amazon.com>
Co-authored-by: Marc Handalian <handalm@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch pending backport Identifies an issue or PR that still needs to be backported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants