promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

slim-bean · 2020-03-23T16:09:53Z

Currently promtail will only retry 500 errors from Loki, but we send rate limit errors as 429's.

There has been discussion about this behavior a few times with the current implemenation basically following this logic:

If a client is sending so many logs that the server is rate limiting, retrying only makes the problem worse as now you have the original volume plus retry volume.

Through discussion there are other cases which are valid and would benefit from retrying 429's, such as longer rate overage than what our burst limit allows, or rate limit recovering from a Loki server being down.

At any rate this change mostly just moves where the logs get dropped if you hit rate limits and are never succesful in sending below the threshold.

Now the behavior will be to sit and retry sending a batch while reading from the log file stalls, if the 429's clear promtail should be able to catch up and send all logs.

If the 429's are not cleared eventually the underlying file will roll and when promtail reads again it will miss what was in the rolled file (with some caveats we do try to send one last time from a rolled file however this may or may not succeed based on the response from the server)

This PR also introduces some larger backoff and retry defaults in promtail allowing up to about 8.5mins of attempts before giving up on the batch and discarding it.

Signed-off-by: Edward Welch edward.welch@grafana.com

… configuring multiple client sections in promtail, also increased the backoff and retry settings in promtail. Signed-off-by: Edward Welch <edward.welch@grafana.com>

owen-d

Looks good, not sure what’s contentious here.

However, I’d probably start the minimum retry at 1s. Maybe increase the maximum backoff, too, assuming we won’t overbuffer in memory (do we have backpressure so we won’t continue reading files when we can’t push?)

We definitely seem vulnerable to network partitions, but that can't be helped without some sort of WAL and I don't want to go down that route (at least not yet).

cyriltovena · 2020-03-23T16:29:23Z

pkg/promtail/client/config.go

-	flag.DurationVar(&c.BackoffConfig.MinBackoff, "client.min-backoff", 100*time.Millisecond, "Initial backoff time between retries.")
-	flag.DurationVar(&c.BackoffConfig.MaxBackoff, "client.max-backoff", 5*time.Second, "Maximum backoff time between retries.")
+	// Default backoff schedule: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s(4.267m) For a total time of 511.5s(8.5m) before logs are lost
+	flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.")


Suggested change

flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.")

flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retries when sending batches.")

owen-d · 2020-03-23T16:31:50Z

docs/clients/promtail/configuration.md

@@ -68,6 +68,11 @@ Supported contents and default values of `config.yaml`:

 # Describes how Promtail connects to multiple instances
 # of Loki, sending logs to each.
+# WARNING: If one of the remote Loki servers fails to respond or responds 
+# with any error which is retriable, this will impact sending logs to any 


Suggested change

# with any error which is retriable, this will impact sending logs to any

# with any error which is retryable, this will impact sending logs to any

slim-bean · 2020-03-23T17:25:48Z

However, I’d probably start the minimum retry at 1s. Maybe increase the maximum backoff, too, assuming we won’t overbuffer in memory (do we have backpressure so we won’t continue reading files when we can’t push?)

I went with 0.5s because our batch send interval is 1s, this would give at least one retry before the next batch would be ready to send, in case of minor interruptions to the internet, etc.

With the current settings of 10 retries we would never get to the max backoff (although close).

And yeah basically everything on the read to send path is synchronous, we do use a channel from each reader to send to the batch however it's a non-buffered channel so it blocks while a batch is being sent which in turn blocks the readers from reading more from the file.

cyriltovena

LGTM, However I'm not sure if docker driver and fluentbit will uses those default worth double checking how they are setup.

slim-bean · 2020-03-23T20:58:35Z

However I'm not sure if docker driver and fluentbit will uses those default worth double checking how they are setup.

From what I could see they do not override these defaults anywhere, or in any of the example config files so I think we are good to go here.

thejvmshid · 2021-08-23T08:04:10Z

promtail_1  | level=warn ts=2021-08-23T07:59:26.712127914Z caller=client.go:344 component=client host=10.10.12.189:18080 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): entry with timestamp 2021-08-23 07:59:26.621124043 +0000 UTC ignored, reason: 'stream rate limit exceeded' for stream: {filename=\"/var/log/b/a.log\", host=\"10.10.12.128\", job=\"a\"},"
promtail_1  | level=warn ts=2021-08-23T07:59:27.369168249Z caller=client.go:344 component=client host=10.10.12.189:18080 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): entry with timestamp 2021-08-23 07:59:26.621124043 +0000 UTC ignored, reason: 'stream rate limit exceeded' for stream: {filename=\"/var/log/b/a.log\", host=\"10.10.12.128\", job=\"a\"},"

Promtail version: latest

promtail retry 429 rate limit errors, clarifying risks in the docs of…

64017cc

… configuring multiple client sections in promtail, also increased the backoff and retry settings in promtail. Signed-off-by: Edward Welch <edward.welch@grafana.com>

pull-request-size bot added the size/M label Mar 23, 2020

owen-d approved these changes Mar 23, 2020

View reviewed changes

cyriltovena reviewed Mar 23, 2020

View reviewed changes

owen-d reviewed Mar 23, 2020

View reviewed changes

cyriltovena approved these changes Mar 23, 2020

View reviewed changes

slim-bean merged commit 6841c41 into master Mar 23, 2020

slim-bean deleted the retry-429 branch March 23, 2020 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

slim-bean commented Mar 23, 2020 •

edited

Loading

owen-d left a comment

cyriltovena Mar 23, 2020

owen-d Mar 23, 2020

slim-bean commented Mar 23, 2020

cyriltovena left a comment

slim-bean commented Mar 23, 2020

thejvmshid commented Aug 23, 2021 •

edited

Loading

	flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.")
	flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retries when sending batches.")

	# with any error which is retriable, this will impact sending logs to any
	# with any error which is retryable, this will impact sending logs to any

promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840

Conversation

slim-bean commented Mar 23, 2020 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

cyriltovena Mar 23, 2020

Choose a reason for hiding this comment

owen-d Mar 23, 2020

Choose a reason for hiding this comment

slim-bean commented Mar 23, 2020

cyriltovena left a comment

Choose a reason for hiding this comment

slim-bean commented Mar 23, 2020

thejvmshid commented Aug 23, 2021 • edited Loading

slim-bean commented Mar 23, 2020 •

edited

Loading

thejvmshid commented Aug 23, 2021 •

edited

Loading