Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WARN Logging on Slow Transport Message Handling #62444

Conversation

original-brownbear
Copy link
Member

Add simple WARN logging on slow inbound TCP messages.

This approach differs from the one used by the test slow logging in that it is more high level, logging the concrete request that was slow instead of the stack trace.
It allows tracking down slowness on TCP inbound (which is the most likely route to experience problematic slowness IMO). The same approach can be added to REST inbound handling and with additional effort also to TCP outbound handling.
I think this approach is much more suitable for finding things like #57937 quickly, has negligible overhead and is a small change that can go into 7.10 already. Building the same logic we use in tests would require much larger changes to our Netty code-base and would not add much beyond the ability to track down dead-locked transport threads.

Add simple WARN logging on slow inbound tcp messages.
@original-brownbear original-brownbear added >non-issue :Distributed/Network Http and internode communication implementations v8.0.0 v7.10.0 labels Sep 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Sep 16, 2020
@@ -71,6 +74,10 @@ void setMessageListener(TransportMessageListener listener) {
}
}

void setSlowLogThreshold(TimeValue slowLogThreshold) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly the chain of setters here is a little hacky but:

  1. Doing it this way made this a much much smaller change than passing the cluster settings directly to the InboundHandler
  2. I'd do a follow-up of this logic for the outbound path and that would require handling the threshold setting in the TransportService anyway.

@@ -132,6 +132,12 @@
Arrays.asList("internal:coordination/fault_detection/*"),
Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope);

// Time that processing an inbound message on a transport thread may take at the most before a warning is logged
public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING =
Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300ms since our default resolution on the timer thread is 200ms. This is already plenty IMO (3 requests per second isn't great on a transport thread) but not so low that any cgroup throttling or other system slowness instantly results in massive log spam.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be kind of nice to be able to see bursts here (thinking about spotting something like CPU throttling)? (if you have a time period where there's a bunch of logging across multiple threads for all kinds of messages without a clear pattern or so). I guess we could use something like the log4j burst filter here to capture that kind of thing and still rate limit but what this:
We could just do something like a hard 5s timeout above which we always WARN and make this a DEBUG at 300ms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with 5s -- I think this would surface problems bad enough to drop nodes from the cluster without risking too much junk.

No need for a separate debug logging threshold IMO -- given that it would need user involvement to see more events in the logs, they may as well just reduce transport.slow_operation_logging_threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point -> pushed the change to 5s :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear this is pretty straightforward. I left a few comments.

final long took = threadPool.relativeTimeInMillis() - startTime;
final long logThreshold = slowLogThresholdMs;
if (logThreshold > 0 && took > logThreshold) {
logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be useful to include the phrase warn threshold as this is a useful search term for the logs when investigating general slowness/instability problems:

Suggested change
logger.warn("Slow handling of transport message [{}] took [{}ms]", message, took);
logger.warn("handling inbound transport message [{}] took [{}ms] which is above the warn threshold of [{}ms]", message, took, logThreshold);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

void setSlowLogThreshold(TimeValue slowLogThreshold) {
this.slowLogThresholdMs = slowLogThreshold.getMillis();
}

void inboundMessage(TcpChannel channel, InboundMessage message) throws Exception {
channel.getChannelStats().markAccessed(threadPool.relativeTimeInMillis());
TransportLogger.logInboundMessage(channel, message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the timing up to this level, since the logging on this line may also be a source of slowness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -132,6 +132,12 @@
Arrays.asList("internal:coordination/fault_detection/*"),
Function.identity(), Setting.Property.Dynamic, Setting.Property.NodeScope);

// Time that processing an inbound message on a transport thread may take at the most before a warning is logged
public static final Setting<TimeValue> SLOW_OPERATION_THRESHOLD_SETTING =
Setting.positiveTimeSetting("transport.slow_operation_logging_threshold", TimeValue.timeValueMillis(300),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit concerned about log spam -- we aren't aware of anything that blocks the transport threads for so long today but that might just be because we don't surface it yet, and some configs might be hitting this a lot. #processors threads times 3 messages per second is too much for my taste. WDYT about a simple rate limit to avoid logging more than one of these per ten seconds or something? That'd be enough to point us in the right direction.

@Tim-Brooks
Copy link
Contributor

I tend to think that if we merge this on by default to 300ms the primary outcome is that we are going to slow log when GCs happen and disrupt time ticks or request handling.

@original-brownbear
Copy link
Member Author

I tend to think that if we merge this on by default to 300ms the primary outcome is that we are going to slow log when GCs happen and disrupt time ticks or request handling.

I wonder how bad this can really get? Even if we run into a complete freeze of the JVM more than 300ms, we will only log a burst of #processors lines once that's over in a one-off manner, that shouldn't be a big deal? It's not like long freezes would actually result in larger bursts here.

That said, if we're actually concerned about this maybe #62444 (comment) is the simplest fix for now (or just putting the thing at debug in general)?

@Tim-Brooks
Copy link
Contributor

Tim-Brooks commented Sep 16, 2020

I wonder how bad this can really get? Even if we run into a complete freeze of the JVM more than 300ms, we will only log a burst of #processors lines once that's over in a one-off manner, that shouldn't be a big deal?

I think that adding WARN logging to production that is designed to for us to track down bugs but will mostly trigger in non-bug scenarios (GCs combined with an incredibly coarse user space timing mechanism) is harmful. I think that setting it to a shockingly high number (5s) with the ability to adjust dynamically to a lower number as necessary seems fine.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Armin & Tim

@original-brownbear
Copy link
Member Author

Thanks David & Tim :)

@original-brownbear original-brownbear merged commit cb5f104 into elastic:master Sep 17, 2020
@original-brownbear original-brownbear deleted the add-slow-warning-transport-threads branch September 17, 2020 06:46
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 17, 2020
Add simple WARN logging on slow inbound TCP messages.
original-brownbear added a commit that referenced this pull request Sep 17, 2020
Add simple WARN logging on slow inbound TCP messages.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 2, 2020
original-brownbear added a commit that referenced this pull request Dec 2, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 2, 2020
original-brownbear added a commit that referenced this pull request Dec 3, 2020
@original-brownbear original-brownbear restored the add-slow-warning-transport-threads branch December 6, 2020 18:55
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 18, 2021
Similar to elastic#62444 but for the outbound path.

This does not log the concrete message that was slow to send
like we do on the inbound path.
This does not detect slowness in individual transport handler logic,
this is done via the inbound handler logging already, but instead
warns if it takes a long time to hand off the message to the relevant
transport thread and then transfer the message over the wire.
This gives some visibility into the stability of the network
connection itself and into the reasons for slow network
responses.
original-brownbear added a commit that referenced this pull request Jan 19, 2021
Similar to #62444 but for the outbound path.

This does not detect slowness in individual transport handler logic,
this is done via the inbound handler logging already, but instead
warns if it takes a long time to hand off the message to the relevant
transport thread and then transfer the message over the wire.
This gives some visibility into the stability of the network
connection itself and into the reasons for slow network
responses (if they are the result of slow networking on the sender).
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 19, 2021
Similar to elastic#62444 but for the outbound path.

This does not detect slowness in individual transport handler logic,
this is done via the inbound handler logging already, but instead
warns if it takes a long time to hand off the message to the relevant
transport thread and then transfer the message over the wire.
This gives some visibility into the stability of the network
connection itself and into the reasons for slow network
responses (if they are the result of slow networking on the sender).
original-brownbear added a commit that referenced this pull request Jan 19, 2021
Similar to #62444 but for the outbound path.

This does not detect slowness in individual transport handler logic,
this is done via the inbound handler logging already, but instead
warns if it takes a long time to hand off the message to the relevant
transport thread and then transfer the message over the wire.
This gives some visibility into the stability of the network
connection itself and into the reasons for slow network
responses (if they are the result of slow networking on the sender).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Meta label for distributed team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants