-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WARN Logging on Slow Transport Message Handling #62444
Add WARN Logging on Slow Transport Message Handling #62444
Conversation
Add simple WARN logging on slow inbound tcp messages.
Pinging @elastic/es-distributed (:Distributed/Network) |
@@ -71,6 +74,10 @@ void setMessageListener(TransportMessageListener listener) { | |||
} | |||
} | |||
|
|||
void setSlowLogThreshold(TimeValue slowLogThreshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly the chain of setters here is a little hacky but:
- Doing it this way made this a much much smaller change than passing the cluster settings directly to the
InboundHandler
- I'd do a follow-up of this logic for the outbound path and that would require handling the threshold setting in the
TransportService
anyway.
@@ -132,6 +132,12 @@ | |||
Arrays.asList("internal:coordination/fault_detection/*"), | |||
Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope); | |||
|
|||
// Time that processing an inbound message on a transport thread may take at the most before a warning is logged | |||
public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING = | |||
Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
300ms
since our default resolution on the timer thread is 200ms
. This is already plenty IMO (3 requests per second isn't great on a transport thread) but not so low that any cgroup throttling or other system slowness instantly results in massive log spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors
threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be kind of nice to be able to see bursts here (thinking about spotting something like CPU throttling)? (if you have a time period where there's a bunch of logging across multiple threads for all kinds of messages without a clear pattern or so). I guess we could use something like the log4j burst filter here to capture that kind of thing and still rate limit but what this:
We could just do something like a hard 5s
timeout above which we always WARN
and make this a DEBUG
at 300ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy with 5s
-- I think this would surface problems bad enough to drop nodes from the cluster without risking too much junk.
No need for a separate debug logging threshold IMO -- given that it would need user involvement to see more events in the logs, they may as well just reduce transport.slow_operation_logging_threshold
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point -> pushed the change to 5s :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @original-brownbear this is pretty straightforward. I left a few comments.
final long took = threadPool.relativeTimeInMillis() - startTime; | ||
final long logThreshold = slowLogThresholdMs; | ||
if (logThreshold > 0 && took > logThreshold) { | ||
logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be useful to include the phrase warn threshold
as this is a useful search term for the logs when investigating general slowness/instability problems:
logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took); | |
logger.warn("handling inbound transport message [{}] took [{}ms] which is above the warn threshold of [{}ms]", message, took, logThreshold); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
void setSlowLogThreshold(TimeValue slowLogThreshold) { | ||
this.slowLogThresholdMs = slowLogThreshold.getMillis(); | ||
} | ||
|
||
void inboundMessage(TcpChannel channel, InboundMessage message) throws Exception { | ||
channel.getChannelStats().markAccessed(threadPool.relativeTimeInMillis()); | ||
TransportLogger.logInboundMessage(channel, message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move the timing up to this level, since the logging on this line may also be a source of slowness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
@@ -132,6 +132,12 @@ | |||
Arrays.asList("internal:coordination/fault_detection/*"), | |||
Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope); | |||
|
|||
// Time that processing an inbound message on a transport thread may take at the most before a warning is logged | |||
public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING = | |||
Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors
threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.
I tend to think that if we merge this on by default to 300ms the primary outcome is that we are going to slow log when GCs happen and disrupt time ticks or request handling. |
I wonder how bad this can really get? Even if we run into a complete freeze of the JVM more than 300ms, we will only log a burst of #processors lines once that's over in a one-off manner, that shouldn't be a big deal? It's not like long freezes would actually result in larger bursts here. That said, if we're actually concerned about this maybe #62444 (comment) is the simplest fix for now (or just putting the thing at debug in general)? |
I think that adding WARN logging to production that is designed to for us to track down bugs but will mostly trigger in non-bug scenarios (GCs combined with an incredibly coarse user space timing mechanism) is harmful. I think that setting it to a shockingly high number ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Armin & Tim
Thanks David & Tim :) |
Add simple WARN logging on slow inbound TCP messages.
Same as elastic#62444 but for REST requests.
Same as #62444 but for REST requests.
Same as elastic#62444 but for REST requests.
Similar to elastic#62444 but for the outbound path. This does not log the concrete message that was slow to send like we do on the inbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses.
Similar to #62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).
Similar to elastic#62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).
Similar to #62444 but for the outbound path. This does not detect slowness in individual transport handler logic, this is done via the inbound handler logging already, but instead warns if it takes a long time to hand off the message to the relevant transport thread and then transfer the message over the wire. This gives some visibility into the stability of the network connection itself and into the reasons for slow network responses (if they are the result of slow networking on the sender).
Add simple WARN logging on slow inbound TCP messages.
This approach differs from the one used by the test slow logging in that it is more high level, logging the concrete request that was slow instead of the stack trace.
It allows tracking down slowness on TCP inbound (which is the most likely route to experience problematic slowness IMO). The same approach can be added to REST inbound handling and with additional effort also to TCP outbound handling.
I think this approach is much more suitable for finding things like #57937 quickly, has negligible overhead and is a small change that can go into
7.10
already. Building the same logic we use in tests would require much larger changes to our Netty code-base and would not add much beyond the ability to track down dead-locked transport threads.