Adding retries to update the metadata store instead of failure #15141

Pankaj260100 · 2023-10-12T10:40:26Z

Fixes #15054.

Description

Currently, If 2 tasks are consuming from the same partitions, try to publish the segment and update the metadata, the second task can fail because the end offset stored in the metadata store doesn't match with the start offset of the second task. We can fix this by retrying instead of failing.
AFAIK apart from the above issue, the metadata mismatch can happen in 2 scenarios:
1. when we update the input topic name for the data source
2. when we run 2 replicas of ingestion tasks(1 replica will publish and 1 will fail as the first replica has already updated the metadata).
Implemented the comparable function to compare the last committed end offset and new Sequence start offset. And return a specific error msg for this.
Add retry logic on indexers to retry for this specific error msg.
Updated the existing test case.

This PR has:

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

...c/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamStartSequenceNumbers.java

xvrl · 2023-10-12T21:25:54Z

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

+      AtomicReference<Boolean> res = new AtomicReference<>(false);
+      partitionSequenceNumberMap.forEach(
+          (partitionId, sequenceOffset) -> {
+            if (otherStart.partitionSequenceNumberMap.get(partitionId) != null && Long.parseLong(String.valueOf(sequenceOffset)) > Long.parseLong(String.valueOf(otherStart.partitionSequenceNumberMap.get(partitionId)))) {


I'm not sure we can assume sequenceOffsets can always be parsed as long. This seems to be a fairly broad assumption. Someone more familiar with the Kinesis supervisor should chime in.

Yes. Kinesis sequence offsets are treated as opaque string.

This could be done. Two methods should be added to DataSourceMetadata

isComparable (default return false) compare() (default return 0)

These two methods will be implemented only by the KafkaDataSourceMetadata. In that class, you can parse the offset to long.

In the IndexerSQLMetadataStorageCoordinator, you would first check isComparable before calling compare method.

How does this sound?

or you can have just KafkaDataSourceMetadata implement Comparable. Then you would check if the metadata is of type comparable and call compare method. Then you don't need isComparable method and the interface doesn't change either. That's actually better and easy to change in the future.

@abhishekagarwal87, I have implemented comparable method only in kafkaDataSourceMetadata. And kinesisDataSourceMetadata will return 0(default). PTAL

how would we handle this potential failure scenario in Kinesis then? It seems that the issue is not specific to Kafka and could happen there as well.

It could, yes. The solution here is to either retry no matter how current and new offsets compare. Or we could add the comparison to Kinesis too - https://docs.aws.amazon.com/kinesis/latest/APIReference/API_SequenceNumberRange.html - The sequence numbers seem to be numbers in string form but can contain up to 128 digits. It should be possible to compare them without converting them to a number.

The class OrderedSequenceNumber may already have the implementation to compare Kafka and Kinesis sequence numbers.

No, OrderedSequenceNumber does not have implementation to compare.

@Pankaj260100, Don't KafkaSequenceNumber and KinesisSequenceNumber which extend this class handle the comparison at an individual sequence number's level?

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamSequenceNumbers.java

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamDataSourceMetadata.java

...e/src/main/java/org/apache/druid/indexing/materializedview/DerivativeDataSourceMetadata.java

abhishekagarwal87

Added some code suggestions to describe better what I am trying to suggest

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

...-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaDataSourceMetadata.java

server/src/main/java/org/apache/druid/indexing/overlord/DataSourceMetadata.java

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

...-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaDataSourceMetadata.java

-{
+import java.util.Comparator;
+
+public class KafkaDataSourceMetadata extends SeekableStreamDataSourceMetadata<KafkaTopicPartition, Long> implements Comparable<KafkaDataSourceMetadata> {


AmatyaAvadhanula · 2023-10-30T03:36:41Z

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

+        (SeekableStreamEndSequenceNumbers<PartitionIdType, SequenceOffsetType>) other;
+
+    if (stream.equals(otherStart.stream)) {
+      //Same stream, compare the offset


Could you also please add a check to compare the partitions to account for repartitioning?

AmatyaAvadhanula · 2023-10-30T03:58:32Z

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamSequenceNumbers.java

+
+  /**
+   * Compare this and the other sequence offsets using comparator.
+   * Returns 1, if this sequence is ahead of the other.


Could you please elaborate what a sequence being ahead of the other means?

I think that it means that the partition-wise sequence number of the first is greater than or equal to the other's with a strict greater than condition for at least one of the partitions. WDYT?

@AmatyaAvadhanula, We will only retry when the first is greater than the other's; in case both are equal, this will not fail to publish, So no point in retry, right?

Isn't it possible that when there are 10 partitions, 6 have strictly greater sequence numbers while the remaining 4 have equal sequence numbers because no new data was added to them?
I think we should retry in this case as well

@AmatyaAvadhanula, This case is covered. We will retry when atleast one partition sequence number is strictly greater than the other. We will compare all partition offsets, and if we find one of the partition offsets greater than the other offset, we update the res variable as true. And retry in that case.

I was wondering if it's possible that one partition is strictly greater but the other is strictly lower

Hey @AmatyaAvadhanula, as per our discussion I have added a check to verify task partitions are contained within the set in the metadata total set.

I have reverted this: the test case where a new kafka partition gets added and wants to publish for the first time will fail as it's not in the old committed offset.

AmatyaAvadhanula · 2023-10-30T04:06:18Z

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

+
+    if (stream.equals(otherStart.stream)) {
+      //Same stream, compare the offset
+      AtomicReference<Boolean> res = new AtomicReference<>(false);


Does this need to change slightly according to https://github.com/apache/druid/pull/15141/files#r1375643913?

AmatyaAvadhanula · 2023-10-30T04:08:24Z

@Pankaj260100, since #15054 is reproducible, could you please test this patch on a Druid cluster and indicate the same in the PR description as well?

Pankaj260100 · 2023-11-16T16:08:57Z

@AmatyaAvadhanula @abhishekagarwal87 @xvrl, I tried to test this patch in druid 25 in one of the dev druid clusters and whenever a task failed to update the metadata store, it started retrying. Then the retry logic executes in sequence first it completes 10 retries for 1 task then for another task and then the ingestion lag starts increasing, I was expecting it to happen in parallel. Do you have any idea why this is happening?
For ex:
Retry for 1st task starting from: 13:34:23.173 UTC ends at: 13:38:40.378 UTC
Retry for 2nd task starts at: 13:38:40.423 UTC
during this time the CPU usage went very low for Overlord as it's just retrying to update the metadata.

abhishekagarwal87 · 2023-11-17T07:02:46Z

@Pankaj260100 - can you elaborate a bit further? what is the 1st task and what is the 2nd task? can you use the same terms as you used in the issue you filed (#15054)

AmatyaAvadhanula · 2023-12-13T10:13:26Z

It appears that retries on the Overlord happen within a transaction and that would explain the observation.
Perhaps the OverlordResource could return a status code other than 400 for certain exceptions, and the task could retry submitting the task action in such cases.

abhishekagarwal87 · 2024-01-06T15:47:54Z

@Pankaj260100 - did you verify this patch in any test cluster? Does it fix the problem?

Pankaj260100 · 2024-01-06T15:52:00Z

@abhishekagarwal87, Yes I tested this patch on the test druid cluster. I submitted the config twice, So there were 2 sets of tasks publishing simultaneously. From logs, I verified retries happened and there was no task failed.

abhishekagarwal87

awesome. I just realized that I had some pending comments that I forgot to publish. So doing that now. There are also some comments from Amatya on the PR.

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

...-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaDataSourceMetadata.java

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java

+                    segmentsAndCommitMetadata.getSegments(),
+                    "Failed publish, not removing segments"
+                );
+                Throwables.propagateIfPossible(e);


server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java

          }
-
-          return segmentsAndCommitMetadata;
+          Throwables.propagateIfPossible(e);


AmatyaAvadhanula · 2024-01-10T06:07:25Z

Thank you @Pankaj260100. LGTM!

kfaraz · 2024-01-10T06:59:45Z

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java

@@ -2054,18 +2054,37 @@ protected DataStoreMetadataUpdateResult updateDataSourceMetadataWithHandle(
    }

    final boolean startMetadataMatchesExisting;
+    int startMetadataGreaterThanExisting = 0;


Can this be a boolean instead?

Yes, we can have a boolean. I have implemented the Comparable for compareTo() method and it returns int. so, I didn't change it.

comareTo() function returns +1, -1 and 0 for greaterThan, lessThan and Equal repectively.

abhishekagarwal87 · 2024-01-10T07:01:45Z

@kfaraz - I merged this without looking at your comment. But feel free to comment and @Pankaj260100 can hopefully address those in a follow-up PR.

Pankaj260100 · 2024-01-10T16:32:33Z

Thanks, @abhishekagarwal87, @AmatyaAvadhanula & @xvrl for the help.

But feel free to comment and @Pankaj260100 can hopefully address those in a follow-up PR.

Sure, I will do that.

…e#15141) Currently, If 2 tasks are consuming from the same partitions, try to publish the segment and update the metadata, the second task can fail because the end offset stored in the metadata store doesn't match with the start offset of the second task. We can fix this by retrying instead of failing. AFAIK apart from the above issue, the metadata mismatch can happen in 2 scenarios: - when we update the input topic name for the data source - when we run 2 replicas of ingestion tasks(1 replica will publish and 1 will fail as the first replica has already updated the metadata). Implemented the comparable function to compare the last committed end offset and new Sequence start offset. And return a specific error msg for this. Add retry logic on indexers to retry for this specific error msg. Updated the existing test case.

Changing the logic to retry in few scenarios

f7220c1

github-actions bot added the Area - Ingestion label Oct 12, 2023

Merge branch 'master' into pankaj/15054

31ae856

Pankaj260100 mentioned this pull request Oct 12, 2023

Failed to publish segments because of [java.lang.RuntimeException: Aborting transaction!]. #15054

Closed

github-advanced-security bot found potential problems Oct 12, 2023

View reviewed changes

xvrl reviewed Oct 12, 2023

View reviewed changes

...ce/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamSequenceNumbers.java Outdated Show resolved Hide resolved

xvrl reviewed Oct 12, 2023

View reviewed changes

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamDataSourceMetadata.java Outdated Show resolved Hide resolved

xvrl reviewed Oct 12, 2023

View reviewed changes

...e/src/main/java/org/apache/druid/indexing/materializedview/DerivativeDataSourceMetadata.java Outdated Show resolved Hide resolved

Pankaj260100 added 3 commits October 13, 2023 10:13

Addressing review comments

c428ada

Addressing review comments

5664dc1

Addressing review comments

adeec9b

github-actions bot added the Area - Streaming Ingestion label Oct 14, 2023

Pankaj260100 added 2 commits October 15, 2023 01:24

Fixing test cases

da0731e

minor fixes

fbbcf47

Pankaj260100 requested a review from abhishekagarwal87 October 16, 2023 14:58

Pankaj260100 added 2 commits October 17, 2023 10:22

minor fixes

5d2a34a

minor fixes

345bcb1

abhishekagarwal87 reviewed Oct 18, 2023

View reviewed changes

Addressing review comments

bda2b16

abhishekagarwal87 reviewed Oct 25, 2023

View reviewed changes

server/src/main/java/org/apache/druid/indexing/overlord/DataSourceMetadata.java Outdated Show resolved Hide resolved

server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java Outdated Show resolved Hide resolved

Pankaj260100 added 2 commits October 27, 2023 01:03

Addressing review comments

c3c0f71

minor change

8e4b2b1

github-advanced-security bot found potential problems Oct 27, 2023

View reviewed changes

AmatyaAvadhanula reviewed Oct 30, 2023

View reviewed changes

Pankaj260100 added 4 commits December 20, 2023 12:39

Adding retry on indexer for specific error msg instead of overlord

53378dd

removing unintended changes

eca84d9

removing unintended changes

2a1ced1

Fix retry Logic

484d654

Pankaj260100 requested review from AmatyaAvadhanula, abhishekagarwal87 and xvrl January 4, 2024 16:17

Adding testcases and fix CI checks

3423ec7

abhishekagarwal87 reviewed Jan 6, 2024

View reviewed changes

...src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamEndSequenceNumbers.java Outdated Show resolved Hide resolved

...-indexing-service/src/main/java/org/apache/druid/indexing/kafka/KafkaDataSourceMetadata.java Show resolved Hide resolved

github-advanced-security bot found potential problems Jan 6, 2024

View reviewed changes

server/src/main/java/org/apache/druid/segment/realtime/appenderator/BaseAppenderatorDriver.java Fixed Show fixed Hide fixed

Pankaj260100 added 2 commits January 8, 2024 14:23

Addressing review comments

b639e7d

Removing partition check

88b8043

Pankaj260100 requested a review from abhishekagarwal87 January 8, 2024 11:35

AmatyaAvadhanula reviewed Jan 9, 2024

View reviewed changes

Pankaj260100 added 2 commits January 9, 2024 12:42

address review comment

c0deff7

Adding retry in try/catch block

cafac9f

github-advanced-security bot found potential problems Jan 10, 2024

View reviewed changes

AmatyaAvadhanula approved these changes Jan 10, 2024

View reviewed changes

kfaraz reviewed Jan 10, 2024

View reviewed changes

abhishekagarwal87 merged commit 047c734 into apache:master Jan 10, 2024
83 checks passed

LakshSingla added this to the 29.0.0 milestone Jan 29, 2024

LakshSingla mentioned this pull request Feb 13, 2024

[DRAFT] 29.0.0 release notes #15896

Closed

Pankaj260100 mentioned this pull request Sep 23, 2024

BackPort: Adding retries to update the metadata store instead of failure (#15141) confluentinc/druid#228

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding retries to update the metadata store instead of failure #15141

Adding retries to update the metadata store instead of failure #15141

Pankaj260100 commented Oct 12, 2023 •

edited

Loading

xvrl Oct 12, 2023

abhishekagarwal87 Oct 13, 2023

abhishekagarwal87 Oct 13, 2023

abhishekagarwal87 Oct 13, 2023

Pankaj260100 Oct 16, 2023

xvrl Oct 18, 2023

abhishekagarwal87 Oct 19, 2023

AmatyaAvadhanula Oct 22, 2023

Pankaj260100 Oct 26, 2023

AmatyaAvadhanula Oct 30, 2023

abhishekagarwal87 left a comment

AmatyaAvadhanula Oct 30, 2023

AmatyaAvadhanula Oct 30, 2023 •

edited

Loading

Pankaj260100 Nov 6, 2023

AmatyaAvadhanula Nov 6, 2023

Pankaj260100 Nov 6, 2023

AmatyaAvadhanula Nov 6, 2023

Pankaj260100 Jan 8, 2024

Pankaj260100 Jan 8, 2024

AmatyaAvadhanula Oct 30, 2023

AmatyaAvadhanula commented Oct 30, 2023

Pankaj260100 commented Nov 16, 2023 •

edited

Loading

abhishekagarwal87 commented Nov 17, 2023

AmatyaAvadhanula commented Dec 13, 2023

abhishekagarwal87 commented Jan 6, 2024

Pankaj260100 commented Jan 6, 2024

abhishekagarwal87 left a comment

AmatyaAvadhanula commented Jan 10, 2024

kfaraz Jan 10, 2024

Pankaj260100 Jan 10, 2024

Pankaj260100 Jan 10, 2024

abhishekagarwal87 commented Jan 10, 2024

Pankaj260100 commented Jan 10, 2024

Adding retries to update the metadata store instead of failure #15141

Adding retries to update the metadata store instead of failure #15141

Conversation

Pankaj260100 commented Oct 12, 2023 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmatyaAvadhanula Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmatyaAvadhanula commented Oct 30, 2023

Pankaj260100 commented Nov 16, 2023 • edited Loading

abhishekagarwal87 commented Nov 17, 2023

AmatyaAvadhanula commented Dec 13, 2023

abhishekagarwal87 commented Jan 6, 2024

Pankaj260100 commented Jan 6, 2024

abhishekagarwal87 left a comment

Choose a reason for hiding this comment

AmatyaAvadhanula commented Jan 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekagarwal87 commented Jan 10, 2024

Pankaj260100 commented Jan 10, 2024

Pankaj260100 commented Oct 12, 2023 •

edited

Loading

AmatyaAvadhanula Oct 30, 2023 •

edited

Loading

Pankaj260100 commented Nov 16, 2023 •

edited

Loading