Add WARN Logging on Slow Transport Message Handling #62444

original-brownbear · 2020-09-16T11:49:58Z

Add simple WARN logging on slow inbound TCP messages.

This approach differs from the one used by the test slow logging in that it is more high level, logging the concrete request that was slow instead of the stack trace.
It allows tracking down slowness on TCP inbound (which is the most likely route to experience problematic slowness IMO). The same approach can be added to REST inbound handling and with additional effort also to TCP outbound handling.
I think this approach is much more suitable for finding things like #57937 quickly, has negligible overhead and is a small change that can go into 7.10 already. Building the same logic we use in tests would require much larger changes to our Netty code-base and would not add much beyond the ability to track down dead-locked transport threads.

Add simple WARN logging on slow inbound tcp messages.

elasticmachine · 2020-09-16T11:50:00Z

Pinging @elastic/es-distributed (:Distributed/Network)

original-brownbear · 2020-09-16T11:52:52Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

@@ -71,6 +74,10 @@ void setMessageListener(TransportMessageListener listener) {
        }
    }

+    void setSlowLogThreshold(TimeValue slowLogThreshold) {


Admittedly the chain of setters here is a little hacky but:

Doing it this way made this a much much smaller change than passing the cluster settings directly to the InboundHandler

I'd do a follow-up of this logic for the outbound path and that would require handling the threshold setting in the TransportService anyway.

original-brownbear · 2020-09-16T11:55:06Z

server/src/main/java/org/elasticsearch/transport/TransportSettings.java

@@ -132,6 +132,12 @@
            Arrays.asList("internal:coordination/fault_detection/*"),
            Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope);

+    // Time that processing an inbound message on a transport thread may take at the most before a warning is logged
+    public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING =
+            Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300),


300ms since our default resolution on the timer thread is 200ms. This is already plenty IMO (3 requests per second isn't great on a transport thread) but not so low that any cgroup throttling or other system slowness instantly results in massive log spam.

I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.

I think it might be kind of nice to be able to see bursts here (thinking about spotting something like CPU throttling)? (if you have a time period where there's a bunch of logging across multiple threads for all kinds of messages without a clear pattern or so). I guess we could use something like the log4j burst filter here to capture that kind of thing and still rate limit but what this:
We could just do something like a hard 5s timeout above which we always WARN and make this a DEBUG at 300ms?

I'd be happy with 5s -- I think this would surface problems bad enough to drop nodes from the cluster without risking too much junk.

No need for a separate debug logging threshold IMO -- given that it would need user involvement to see more events in the logs, they may as well just reduce transport.slow_operation_logging_threshold.

Fair point -> pushed the change to 5s :)

DaveCTurner

Thanks @original-brownbear this is pretty straightforward. I left a few comments.

DaveCTurner · 2020-09-16T12:36:27Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

+            final long took = threadPool.relativeTimeInMillis() - startTime;
+            final long logThreshold = slowLogThresholdMs;
+            if (logThreshold > 0 && took > logThreshold) {
+                logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took);


It'd be useful to include the phrase warn threshold as this is a useful search term for the logs when investigating general slowness/instability problems:

Suggested change

logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took);

logger.warn("handling inbound transport message [{}] took [{}ms] which is above the warn threshold of [{}ms]", message, took, logThreshold);

DaveCTurner · 2020-09-16T12:48:32Z

server/src/main/java/org/elasticsearch/transport/InboundHandler.java

+    void setSlowLogThreshold(TimeValue slowLogThreshold) {
+        this.slowLogThresholdMs = slowLogThreshold.getMillis();
+    }
+
    void inboundMessage(TcpChannel channel, InboundMessage message) throws Exception {
        channel.getChannelStats().markAccessed(threadPool.relativeTimeInMillis());
        TransportLogger.logInboundMessage(channel, message);


Could we move the timing up to this level, since the logging on this line may also be a source of slowness?

DaveCTurner · 2020-09-16T13:07:19Z

server/src/main/java/org/elasticsearch/transport/TransportSettings.java

@@ -132,6 +132,12 @@
            Arrays.asList("internal:coordination/fault_detection/*"),
            Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope);

+    // Time that processing an inbound message on a transport thread may take at the most before a warning is logged
+    public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING =
+            Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300),


I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.

Tim-Brooks · 2020-09-16T14:17:34Z

I tend to think that if we merge this on by default to 300ms the primary outcome is that we are going to slow log when GCs happen and disrupt time ticks or request handling.

original-brownbear · 2020-09-16T15:00:59Z

I tend to think that if we merge this on by default to 300ms the primary outcome is that we are going to slow log when GCs happen and disrupt time ticks or request handling.

I wonder how bad this can really get? Even if we run into a complete freeze of the JVM more than 300ms, we will only log a burst of #processors lines once that's over in a one-off manner, that shouldn't be a big deal? It's not like long freezes would actually result in larger bursts here.

That said, if we're actually concerned about this maybe #62444 (comment) is the simplest fix for now (or just putting the thing at debug in general)?

Tim-Brooks · 2020-09-16T15:33:37Z

I wonder how bad this can really get? Even if we run into a complete freeze of the JVM more than 300ms, we will only log a burst of #processors lines once that's over in a one-off manner, that shouldn't be a big deal?

I think that adding WARN logging to production that is designed to for us to track down bugs but will mostly trigger in non-bug scenarios (GCs combined with an incredibly coarse user space timing mechanism) is harmful. I think that setting it to a shockingly high number (5s) with the ability to adjust dynamically to a lower number as necessary seems fine.

DaveCTurner

LGTM, thanks Armin & Tim

original-brownbear · 2020-09-17T06:46:25Z

Thanks David & Tim :)

Add simple WARN logging on slow inbound TCP messages.

Same as elastic#62444 but for REST requests.

Same as #62444 but for REST requests.

Same as elastic#62444 but for REST requests.

Same as #62444 but for REST requests.

Similar to elastic#62444 but for the outbound path. This does not log the concrete message that was slow to send like we do on the inbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses.

Similar to #62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).

Similar to elastic#62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).

Similar to #62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).

Add WARN Logging on Slow Transport Message Handling

be1d0b3

Add simple WARN logging on slow inbound tcp messages.

original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v7.10.0 labels Sep 16, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Sep 16, 2020

original-brownbear commented Sep 16, 2020

View reviewed changes

original-brownbear requested a review from DaveCTurner September 16, 2020 11:55

DaveCTurner reviewed Sep 16, 2020

View reviewed changes

CR: comments

8a336d1

5s it is

df84914

original-brownbear requested a review from DaveCTurner September 16, 2020 15:28

DaveCTurner approved these changes Sep 17, 2020

View reviewed changes

original-brownbear merged commit cb5f104 into elastic:master Sep 17, 2020

original-brownbear deleted the add-slow-warning-transport-threads branch September 17, 2020 06:46

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 17, 2020

Add WARN Logging on Slow Transport Message Handling (elastic#62444)

c56fdd9

Add simple WARN logging on slow inbound TCP messages.

original-brownbear mentioned this pull request Sep 17, 2020

Add WARN Logging on Slow Transport Message Handling (#62444) #62521

Merged

original-brownbear added a commit that referenced this pull request Sep 17, 2020

Add WARN Logging on Slow Transport Message Handling (#62444) (#62521)

5112c17

Add simple WARN logging on slow inbound TCP messages.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 2, 2020

Add WARN Logging for Slow REST Request Handling

0a56153

Same as elastic#62444 but for REST requests.

original-brownbear mentioned this pull request Dec 2, 2020

Add WARN Logging for Slow REST Request Handling #65748

Merged

original-brownbear added a commit that referenced this pull request Dec 2, 2020

Add WARN Logging for Slow REST Request Handling (#65748)

667908e

Same as #62444 but for REST requests.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 2, 2020

Add WARN Logging for Slow REST Request Handling (elastic#65748)

a60fb62

Same as elastic#62444 but for REST requests.

original-brownbear mentioned this pull request Dec 2, 2020

Add WARN Logging for Slow REST Request Handling (#65748) #65777

Merged

original-brownbear added a commit that referenced this pull request Dec 3, 2020

Add WARN Logging for Slow REST Request Handling (#65748) (#65777)

da4d455

Same as #62444 but for REST requests.

original-brownbear restored the add-slow-warning-transport-threads branch December 6, 2020 18:55

original-brownbear mentioned this pull request Jan 18, 2021

Log Slowness on Sending Transport Messages #67664

Merged

original-brownbear mentioned this pull request Jan 19, 2021

Log Slowness on Sending Transport Messages (#67664) #67690

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WARN Logging on Slow Transport Message Handling #62444

Add WARN Logging on Slow Transport Message Handling #62444

original-brownbear commented Sep 16, 2020

elasticmachine commented Sep 16, 2020

original-brownbear Sep 16, 2020

original-brownbear Sep 16, 2020

DaveCTurner Sep 16, 2020

original-brownbear Sep 16, 2020

DaveCTurner Sep 16, 2020

original-brownbear Sep 16, 2020

DaveCTurner left a comment

DaveCTurner Sep 16, 2020

original-brownbear Sep 16, 2020

DaveCTurner Sep 16, 2020

original-brownbear Sep 16, 2020

DaveCTurner Sep 16, 2020

Tim-Brooks commented Sep 16, 2020

original-brownbear commented Sep 16, 2020

Tim-Brooks commented Sep 16, 2020 •

edited

Loading

DaveCTurner left a comment

original-brownbear commented Sep 17, 2020

	logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took);
	logger.warn("handling inbound transport message [{}] took [{}ms] which is above the warn threshold of [{}ms]", message, took, logThreshold);

Add WARN Logging on Slow Transport Message Handling #62444

Add WARN Logging on Slow Transport Message Handling #62444

Conversation

original-brownbear commented Sep 16, 2020

elasticmachine commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tim-Brooks commented Sep 16, 2020

original-brownbear commented Sep 16, 2020

Tim-Brooks commented Sep 16, 2020 • edited Loading

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Sep 17, 2020

Tim-Brooks commented Sep 16, 2020 •

edited

Loading