Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

Merged
merged 1 commit into from
Mar 23, 2020

Conversation

slim-bean
Copy link
Collaborator

@slim-bean slim-bean commented Mar 23, 2020

Currently promtail will only retry 500 errors from Loki, but we send rate limit errors as 429's.

There has been discussion about this behavior a few times with the current implemenation basically following this logic:

If a client is sending so many logs that the server is rate limiting, retrying only makes the problem worse as now you have the original volume plus retry volume.

Through discussion there are other cases which are valid and would benefit from retrying 429's, such as longer rate overage than what our burst limit allows, or rate limit recovering from a Loki server being down.

At any rate this change mostly just moves where the logs get dropped if you hit rate limits and are never succesful in sending below the threshold.

Now the behavior will be to sit and retry sending a batch while reading from the log file stalls, if the 429's clear promtail should be able to catch up and send all logs.

If the 429's are not cleared eventually the underlying file will roll and when promtail reads again it will miss what was in the rolled file (with some caveats we do try to send one last time from a rolled file however this may or may not succeed based on the response from the server)

This PR also introduces some larger backoff and retry defaults in promtail allowing up to about 8.5mins of attempts before giving up on the batch and discarding it.

Signed-off-by: Edward Welch edward.welch@grafana.com

… configuring multiple client sections in promtail, also increased the backoff and retry settings in promtail.

Signed-off-by: Edward Welch <edward.welch@grafana.com>
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, not sure what’s contentious here.

However, I’d probably start the minimum retry at 1s. Maybe increase the maximum backoff, too, assuming we won’t overbuffer in memory (do we have backpressure so we won’t continue reading files when we can’t push?)

We definitely seem vulnerable to network partitions, but that can't be helped without some sort of WAL and I don't want to go down that route (at least not yet).

flag.DurationVar(&c.BackoffConfig.MinBackoff, "client.min-backoff", 100*time.Millisecond, "Initial backoff time between retries.")
flag.DurationVar(&c.BackoffConfig.MaxBackoff, "client.max-backoff", 5*time.Second, "Maximum backoff time between retries.")
// Default backoff schedule: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s(4.267m) For a total time of 511.5s(8.5m) before logs are lost
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.")
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retries when sending batches.")

@@ -68,6 +68,11 @@ Supported contents and default values of `config.yaml`:

# Describes how Promtail connects to multiple instances
# of Loki, sending logs to each.
# WARNING: If one of the remote Loki servers fails to respond or responds
# with any error which is retriable, this will impact sending logs to any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# with any error which is retriable, this will impact sending logs to any
# with any error which is retryable, this will impact sending logs to any

@slim-bean
Copy link
Collaborator Author

However, I’d probably start the minimum retry at 1s. Maybe increase the maximum backoff, too, assuming we won’t overbuffer in memory (do we have backpressure so we won’t continue reading files when we can’t push?)

I went with 0.5s because our batch send interval is 1s, this would give at least one retry before the next batch would be ready to send, in case of minor interruptions to the internet, etc.

With the current settings of 10 retries we would never get to the max backoff (although close).

And yeah basically everything on the read to send path is synchronous, we do use a channel from each reader to send to the batch however it's a non-buffered channel so it blocks while a batch is being sent which in turn blocks the readers from reading more from the file.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, However I'm not sure if docker driver and fluentbit will uses those default worth double checking how they are setup.

@slim-bean
Copy link
Collaborator Author

However I'm not sure if docker driver and fluentbit will uses those default worth double checking how they are setup.

From what I could see they do not override these defaults anywhere, or in any of the example config files so I think we are good to go here.

@slim-bean slim-bean merged commit 6841c41 into master Mar 23, 2020
@slim-bean slim-bean deleted the retry-429 branch March 23, 2020 20:58
@thejvmshid
Copy link

thejvmshid commented Aug 23, 2021

promtail_1  | level=warn ts=2021-08-23T07:59:26.712127914Z caller=client.go:344 component=client host=10.10.12.189:18080 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): entry with timestamp 2021-08-23 07:59:26.621124043 +0000 UTC ignored, reason: 'stream rate limit exceeded' for stream: {filename=\"/var/log/b/a.log\", host=\"10.10.12.128\", job=\"a\"},"
promtail_1  | level=warn ts=2021-08-23T07:59:27.369168249Z caller=client.go:344 component=client host=10.10.12.189:18080 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): entry with timestamp 2021-08-23 07:59:26.621124043 +0000 UTC ignored, reason: 'stream rate limit exceeded' for stream: {filename=\"/var/log/b/a.log\", host=\"10.10.12.128\", job=\"a\"},"

Promtail version: latest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants