Do not emit negative lag because of stale offsets #14292

AmatyaAvadhanula · 2023-05-16T12:37:11Z

Do not emit negative lag because of stale offsets.

Description

The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag .

This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.

Release note

This PR prevents emission of negative streaming ingestion lag when the fetched latest offsets are stale

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

abhishekagarwal87 · 2023-05-16T14:55:41Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+            System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis()
+        );
+        if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
+          log.warn("Skipping negative lag emission as fetched offsets are stale");


lets rephrase it in a way that is more informative for someone reading this.
Lag is negative and will not be emitted because topic offsets have become stale. This will not impact data processing. Offsets become stale because....

abhishekrb19 · 2023-05-19T00:16:58Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+            && sequenceLastUpdated.getMillis()
+               < System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis();
+        if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "


For troubleshooting, I think it'll also be good to log the topic:partition info where the offsets may potentially be stale

that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".

abhishekrb19 · 2023-05-19T00:17:38Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

@@ -4220,6 +4220,18 @@ protected void emitLag()
          return;
        }

+        // Try emitting lag even with stale metrics provided that none of the partitions has negative lag


Suggested change

// Try emitting lag even with stale metrics provided that none of the partitions has negative lag

// Try emitting lag even with stale metrics provided that none of the partitions have negative lag

abhishekrb19 · 2023-05-19T00:43:46Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+                   + "This will not impact data processing. "
+                   + "Offsets may become stale because of connectivity issues.");
+          return;


Should we skip emitting lag metrics only for the stale partitions? I think in general, it'll be helpful to emit metrics for partitions that have non-negative lag. For example, if a topic's partitions are spread across multiple brokers and only some have connectivity issues. Or for a topic where some partitions receive little to no data, those may selectively be considered "stale".

If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.

I think there is a separate metric which we can emit for partition-level lag, without actually reporting/affecting the overall lag at all. But I guess having them in the report should be enough too.

Yeah, a per-partition lag metric would complement the existing metrics. My main concern with not reporting any lag for a topic in this scenario is we'd have periods of missing lag data for as long as there's at least one stale partition in a topic. The missing metrics data can hide problems silently and affect existing downstream consumers of the data on how they alert, present metrics for visualization, etc. What do you think?

a missing metric data for a topic is easier to detect and be notified about than metric data missing some partitions. @AmatyaAvadhanula - do we already emit a lag metric for each partition in the topic?

Yes, we do emit metrics for every partition

These partitions usually go stale because supervisor can't connect to Kafka. We can revisit later if not having any metric becomes a pain point. ideally, users should also be alerting on missing metric.

abhishekagarwal87 · 2023-05-22T06:25:33Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+        if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+                   + "This will not impact data processing. "
+                   + "Offsets may become stale because of connectivity issues.");


"Offsets may become stale because of connectivity issues." - This isn't very helpful.

abhishekagarwal87 · 2023-05-22T06:26:23Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+        if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+                   + "This will not impact data processing. "
+                   + "Offsets may become stale because of connectivity issues.");


Suggested change

+ "Offsets may become stale because of connectivity issues.");

+ "Offsets usually become stale when tasks cannot connect to Kafka cluster.");

abhishekagarwal87 · 2023-05-22T06:27:41Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+            && sequenceLastUpdated.getMillis()
+               < System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis();
+        if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) {
+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "


that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".

abhishekagarwal87 · 2023-05-22T06:31:06Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+          log.warn("Lag is negative and will not be emitted because topic offsets have become stale. "
+                   + "This will not impact data processing. "
+                   + "Offsets may become stale because of connectivity issues.");
+          return;


If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.

The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag . This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.

AmatyaAvadhanula added 2 commits May 16, 2023 17:58

Do not emit negative lag because of stale offsets

13df220

Revert accidental deletion

2ef4257

abhishekagarwal87 added the Area - Streaming Ingestion label May 16, 2023

abhishekagarwal87 reviewed May 16, 2023

View reviewed changes

AmatyaAvadhanula added 3 commits May 17, 2023 11:21

Fix forbidden API and log message

45d86b1

Remove previous log message

4b2dab6

Fix forbidden api in test

e98101c

abhishekrb19 reviewed May 19, 2023

View reviewed changes

abhishekagarwal87 reviewed May 22, 2023

View reviewed changes

AmatyaAvadhanula added 3 commits May 29, 2023 15:14

Merge remote-tracking branch 'upstream/master' into fixNegativeLag

1f7512e

Merge remote-tracking branch 'upstream/master' into fixNegativeLag

d2688ef

Less noisy logs and use UTC times in tests

bd8aaa1

AmatyaAvadhanula requested a review from kfaraz June 28, 2023 10:32

abhishekagarwal87 approved these changes Jul 5, 2023

View reviewed changes

abhishekagarwal87 merged commit 609833c into apache:master Jul 5, 2023

AmatyaAvadhanula added this to the 27.0 milestone Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

abhishekrb19 mentioned this pull request Sep 18, 2024

fix negative lag metircs issue + improve API design for parition lag #17060

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not emit negative lag because of stale offsets #14292

Do not emit negative lag because of stale offsets #14292

AmatyaAvadhanula commented May 16, 2023

abhishekagarwal87 May 16, 2023

abhishekrb19 May 19, 2023

abhishekagarwal87 May 22, 2023

abhishekrb19 May 19, 2023

abhishekrb19 May 19, 2023 •

edited

Loading

abhishekagarwal87 May 22, 2023

kfaraz May 22, 2023

abhishekrb19 May 22, 2023 •

edited

Loading

abhishekagarwal87 May 23, 2023

AmatyaAvadhanula May 23, 2023

abhishekagarwal87 May 23, 2023

abhishekagarwal87 May 22, 2023

abhishekagarwal87 May 22, 2023

abhishekagarwal87 May 22, 2023

abhishekagarwal87 May 22, 2023

	// Try emitting lag even with stale metrics provided that none of the partitions has negative lag
	// Try emitting lag even with stale metrics provided that none of the partitions have negative lag

	+ "Offsets may become stale because of connectivity issues.");
	+ "Offsets usually become stale when tasks cannot connect to Kafka cluster.");

Do not emit negative lag because of stale offsets #14292

Do not emit negative lag because of stale offsets #14292

Conversation

AmatyaAvadhanula commented May 16, 2023

Description

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekrb19 May 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekrb19 May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhishekrb19 May 19, 2023 •

edited

Loading

abhishekrb19 May 22, 2023 •

edited

Loading