-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not emit negative lag because of stale offsets #14292
Do not emit negative lag because of stale offsets #14292
Conversation
System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis() | ||
); | ||
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) { | ||
log.warn("Skipping negative lag emission as fetched offsets are stale"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets rephrase it in a way that is more informative for someone reading this.
Lag is negative and will not be emitted because topic offsets have become stale. This will not impact data processing. Offsets become stale because....
&& sequenceLastUpdated.getMillis() | ||
< System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis(); | ||
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) { | ||
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For troubleshooting, I think it'll also be good to log the topic:partition info where the offsets may potentially be stale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".
@@ -4220,6 +4220,18 @@ protected void emitLag() | |||
return; | |||
} | |||
|
|||
// Try emitting lag even with stale metrics provided that none of the partitions has negative lag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Try emitting lag even with stale metrics provided that none of the partitions has negative lag | |
// Try emitting lag even with stale metrics provided that none of the partitions have negative lag |
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " | ||
+ "This will not impact data processing. " | ||
+ "Offsets may become stale because of connectivity issues."); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we skip emitting lag metrics only for the stale partitions? I think in general, it'll be helpful to emit metrics for partitions that have non-negative lag. For example, if a topic's partitions are spread across multiple brokers and only some have connectivity issues. Or for a topic where some partitions receive little to no data, those may selectively be considered "stale".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a separate metric which we can emit for partition-level lag, without actually reporting/affecting the overall lag at all. But I guess having them in the report should be enough too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a per-partition lag metric would complement the existing metrics. My main concern with not reporting any lag for a topic in this scenario is we'd have periods of missing lag data for as long as there's at least one stale partition in a topic. The missing metrics data can hide problems silently and affect existing downstream consumers of the data on how they alert, present metrics for visualization, etc. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a missing metric data for a topic is easier to detect and be notified about than metric data missing some partitions. @AmatyaAvadhanula - do we already emit a lag metric for each partition in the topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do emit metrics for every partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These partitions usually go stale because supervisor can't connect to Kafka. We can revisit later if not having any metric becomes a pain point. ideally, users should also be alerting on missing metric.
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) { | ||
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " | ||
+ "This will not impact data processing. " | ||
+ "Offsets may become stale because of connectivity issues."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Offsets may become stale because of connectivity issues." - This isn't very helpful.
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) { | ||
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " | ||
+ "This will not impact data processing. " | ||
+ "Offsets may become stale because of connectivity issues."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ "Offsets may become stale because of connectivity issues."); | |
+ "Offsets usually become stale when tasks cannot connect to Kafka cluster."); |
&& sequenceLastUpdated.getMillis() | ||
< System.currentTimeMillis() - tuningConfig.getOffsetFetchPeriod().getMillis(); | ||
if (areOffsetsStale && partitionLags.values().stream().anyMatch(x -> x < 0)) { | ||
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that info can bloat the log a lot. We can just say that "Check the task report for more details around lag".
log.warn("Lag is negative and will not be emitted because topic offsets have become stale. " | ||
+ "This will not impact data processing. " | ||
+ "Offsets may become stale because of connectivity issues."); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, it will be very easy to get into a wrong debugging trail where the overall lag might appear lower than it actually is. I am in favor of not emitting lag for any partition at all. The partition level lag would still be available in the task reports.
The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag . This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.
Do not emit negative lag because of stale offsets.
Description
The latest topic offsets are polled frequently and used to determine the lag based on the current offsets. However, when the offsets are stale (which can happen due to connection issues commonly), we may see a negative lag .
This PR prevents emission of metrics when the offsets are stale and at least one of the partitions has a negative lag.
Release note
This PR prevents emission of negative streaming ingestion lag when the fetched latest offsets are stale
This PR has: