Separated Metrics Handling for Throughput Violating Topics #930

shrinandthakkar · 2023-04-04T18:53:15Z

The EventProducer of every DatastreamTask reports SLA and latency metrics for every datastream record. But when topics (at least one partition) have higher throughput than the brooklin permissible thresholds, it introduces latency and SLA misses in the mirroring pipeline.

This pull request is the second part of changes to handle the metrics and reporting of throughput-violating topics separately. It introduces the following changes:

Separately reporting latency and SLA metrics for these throughput-violating topics within EventProducer.
Added per datastream gauge to get insights on the frequency of these throughput violations.

Handling Metrics and SLA Reporting for Throughput Violating Topics via Datastream Update API (Part 1 of this work) is merged and can be referenced here.

vmaheshw · 2023-04-10T17:54:38Z

...ream-server-restli/src/main/java/com/linkedin/datastream/server/dms/DatastreamResources.java

@@ -306,6 +306,9 @@ private void doUpdateDatastreams(Map<String, Datastream> datastreamMap) {
          datastreamMap.get(key)
              .getMetadata()
              .put(DatastreamMetadataConstants.THROUGHPUT_VIOLATING_TOPICS, StringUtils.EMPTY);
+          LOG.info(
+              "Feature handling throughput violations disabled. Flushed throughput violating topics for datastream {}",


what does Flushed mean here?

Will it make more sense to use Discarded instead of Flushed

vmaheshw · 2023-04-10T17:56:12Z

datastream-server-restli/src/test/java/com/linkedin/datastream/server/TestCoordinator.java

@@ -3613,6 +3613,8 @@ public void testThroughputViolatingTopicsHandlingForSingleDatastream() throws Ex
    String testCluster = "testThroughputViolatingTopicsHandlingForSingleDatastream";
    String connectorType = "connectorType";
    String streamName = "testThroughputViolatingTopicsHandlingForSingleDatastream";
+    String numThroughputViolatingTopicsMetric =


You should reference the actual metric rather than redefining it.

Is this just for testing purposes?

yes only for testing purposes. I went this route since other tests have similar behavior as the metrics subclass is private. But instead I created a visible for testing function to retrieve the metric name.

vmaheshw · 2023-04-10T17:57:31Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

-            _throughputViolatingTopicsMap.put(datastream.getName(), new HashSet<>(Arrays.asList(violatingTopics)));
-          }
+            .getOrDefault(DatastreamMetadataConstants.THROUGHPUT_VIOLATING_TOPICS, StringUtils.EMPTY);
+        String[] violatingTopics = Arrays.stream(commaSeparatedViolatingTopics.split(","))


What if the parsing is incorrect, the message got trimmed, or is malformed?

Will it make sense to have a clean try-catch block?

The function populateThroughputViolatingTopicsMap is already sitting within a try catch.
Also, I added another UT to test malformed metadata scenario, so we shouldn't need a newer try catch I think.

vmaheshw · 2023-04-10T17:59:12Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

@@ -2462,6 +2470,13 @@ private void registerGauge(String metricName, Supplier<?> valueSupplier) {
      _metricInfos.add(new BrooklinGaugeInfo(_coordinator.buildMetricName(MODULE, metricName)));
    }

+    // registers a new gauge or updates the supplier for the gauge if it already exists


This looks like a generic call. Why is this required? We have other instances of `Gauge and do not have this.

Based on the reasoning, we will have to do it for all the Gauge instances and move this method to generic location.

I added this new method since I wanted to report the change in the number of violations per datastream, hence needed a gauge for dynamic keys (datastream names).

The register gauge registers a new gauge for a new key but for any existing key, it returns the already registered gauge and hence this function will update the supplier function for an already registered gauge metric.

vmaheshw · 2023-04-10T18:01:25Z

datastream-server/src/main/java/com/linkedin/datastream/server/Coordinator.java

@@ -2462,6 +2470,13 @@ private void registerGauge(String metricName, Supplier<?> valueSupplier) {
      _metricInfos.add(new BrooklinGaugeInfo(_coordinator.buildMetricName(MODULE, metricName)));
    }

+    // registers a new gauge or updates the supplier for the gauge if it already exists
+    private <T> void registerOrSetGauge(String metricName, Supplier<T> valueSupplier) {
+      _dynamicMetricsManager.setGauge(_dynamicMetricsManager.registerGauge(MODULE, metricName, valueSupplier),


Will this metric get emitted for zero value as well?

This can be a concern if enabled for the Change capture cluster with many datastreams. It will make sense to have aggregate level metrics and selectively enable datastream-level metrics because X datastream means X Gauge.

This will be only be emitted wherever the feature of handling bad actors is enabled.
Also I updated the logic to only handle and register/update a gauge when this metadata field exists for that datastream.

vmaheshw · 2023-04-10T18:06:28Z

datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java

+      reportEventLatencyMetrics(metadata, eventsSourceTimestamp, THROUGHPUT_VIOLATING_EVENTS_LATENCY_MS_STRING);
+      _dynamicMetricsManager.createOrUpdateCounter(MODULE, AGGREGATE, TOTAL_EVENTS_PRODUCED, 1);
+      _dynamicMetricsManager.createOrUpdateCounter(MODULE, _datastreamTask.getConnectorType(), TOTAL_EVENTS_PRODUCED,


This requires more precise documentation explaining X events from the throughput violation topic and Y events from the regular topic and total records X+Y.

vmaheshw · 2023-04-10T18:09:30Z

datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java

@@ -58,6 +58,8 @@ public class EventProducer implements DatastreamEventProducer {

  static final String EVENTS_LATENCY_MS_STRING = "eventsLatencyMs";
  static final String EVENTS_SEND_LATENCY_MS_STRING = "eventsSendLatencyMs";
+  static final String THROUGHPUT_VIOLATING_EVENTS_LATENCY_MS_STRING = "throughputViolatingEventsLatencyMs";
+  static final String THROUGHPUT_VIOLATING_EVENTS_SEND_LATENCY_MS_STRING = "throughputViolatingEventsSendLatencyMs";


What about the other metrics from lines 67-70? Does the reporting need to be split for that as well?

private static final String EVENTS_PRODUCED_WITHIN_SLA = "eventsProducedWithinSla"; private static final String EVENTS_PRODUCED_WITHIN_ALTERNATE_SLA = "eventsProducedWithinAlternateSla";

For these latency violations, we can skip this metric EVENTS_PRODUCED_WITHIN_SLA, but I updated the code to emit EVENTS_PRODUCED_WITHIN_ALTERNATE_SLA metric as those would be applicable.

vmaheshw · 2023-04-10T18:10:41Z

datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java

@@ -58,6 +58,8 @@ public class EventProducer implements DatastreamEventProducer {

  static final String EVENTS_LATENCY_MS_STRING = "eventsLatencyMs";
  static final String EVENTS_SEND_LATENCY_MS_STRING = "eventsSendLatencyMs";
+  static final String THROUGHPUT_VIOLATING_EVENTS_LATENCY_MS_STRING = "throughputViolatingEventsLatencyMs";


Do you need additional metric emission checks?

…handling updates based on the comments to the PR

Separated Metrics Handling for Throughput Violations

b76110d

shrinandthakkar requested review from vmaheshw, jzakaryan and thomaslaw April 4, 2023 19:33

vmaheshw reviewed Apr 10, 2023

View reviewed changes

Reporting Alt SLA metrics for throughout violators + and minor logic …

f7f9e39

…handling updates based on the comments to the PR

vmaheshw approved these changes May 2, 2023

View reviewed changes

atoomula approved these changes May 2, 2023

View reviewed changes

shrinandthakkar merged commit 892b740 into linkedin:master May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separated Metrics Handling for Throughput Violating Topics #930

Separated Metrics Handling for Throughput Violating Topics #930

shrinandthakkar commented Apr 4, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw May 2, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw Apr 10, 2023

vmaheshw Apr 10, 2023

shrinandthakkar Apr 11, 2023

vmaheshw Apr 10, 2023

Separated Metrics Handling for Throughput Violating Topics #930

Separated Metrics Handling for Throughput Violating Topics #930

Conversation

shrinandthakkar commented Apr 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment