fix(docker-driver): Propagate promtail's `client.Stop` properly #2898

kavirajk · 2020-11-08T20:35:30Z

promtail client stop is not propaged properly to the its batchSend method
in the client's run method.

though run method multiplex on c.quit channel, the other branch can block
in batchSend without even listening on the c.quit channel on the run method.
This fixes it by multiplex on c.quit inside the batchSend as well.

Other suggestion: expose client.run to promtail client and they control it via
context. But that requires API changes like accepting ctx in client.Run method.

fixes: #2361

Before

-bash5.0$ time docker kill 0eb109b4e588
0eb109b4e588

real	12m56.192s
user	0m0.206s
sys	0m0.119s

After

-bash5.0$ time docker kill e67c1a43000a
e67c1a43000a

real	0m0.298s
user	0m0.108s
sys	0m0.058s

Following plugin has the fix to experiment.

-bash5.0$ docker plugin install kavirajk/loki-docker-driver:latest --alias loki-fix

pkg/promtail/client/client.go

cyriltovena · 2020-11-09T14:07:35Z

@slim-bean are you ok for stopping retrying early ? I think you had opinions ?

I think you could remove the quit channel and use a context.WithCancel instead and call the cancel in the close. This way we have only one synchronization and you don't have to change anything in the retrying logic. WDYT ?

kavirajk · 2020-11-09T14:33:48Z

@cyriltovena, even I don't completely like the idea of having to listen on quit channel multiple places!

But I don't understand what you mean by calling cancel() in close? you mean on the client.Stop() method?.

the core problem is the ctx we pass into batchSend need to be cancelled when client.Stop() is called. So you want to store the context and cancel func on the client struct and use same ctx in all the batchSend and call cancel inside the client.Stop()?

cyriltovena · 2020-11-10T13:17:59Z

My idea was:

type client struct {
...
ctx context.Context
cancel context.CancelFunc
...
}

func NewClient(ctx,....) {
  ctx , cancel := context.WithCancel(ctx)
  return & client{
    ctx : ctx,
    cancel: cancel,
  }
}

func(c *client) Stop() {
c.cancel()
...
}

Then you can use the client context in the backoff and you don't need to change all of the code anymore.

kavirajk · 2020-11-10T13:25:26Z

@cyriltovena, yea! right! actually thought about that approach as well.

I just felt like having ctx as the part of struct not the best approach, but here its ok I think, since the calling batchSend is private to client.

but one other thing is, in that case we can totally remove the quit channel. And replace it with <-ctx.Done(). Having both ctx and quit becomes unnecessary then! wdyt?

cyriltovena · 2020-11-10T13:28:06Z

Yes I want quit to go. But more importantly, I think @slim-bean is really opinionated about this whole PR. While this helps docker, it causes some possibility of losing data, for other use case. I'm wondering if this behaviour could be optional and use just for the docker driver.

slim-bean · 2020-11-10T14:12:32Z

I think what i'd like is for SIGINT to initiate shutdown but it will still respect the retries and timeouts, and then SIGKILL exits no matter what.

I think it also might be nice to trap a second SIGINT to initiate shutdown regardless of retries (we should log a message here after the first SIGINT that a second SIGINT or a SIGKILL will force quit and abandon retries)

kavirajk · 2020-11-10T15:07:11Z

@slim-bean I see what you mean. Can be done for SIGINT the way you wanted, by watching for the signal on driver's main! and graceful shutdown on the driver's StopLogging method (have to check what can be done for retry though)!.

But nothing can be done on SIGKILL, as it cannot be catched!

will update the PR accordingly, also for the canceling the batchSend context, will go with storing the context and cancel func on client struct!. Won't delete quit for now!

slim-bean · 2020-11-10T16:09:22Z

But nothing can be done on SIGKILL, as it cannot be catched!

yeah sorry, SIGTERM is probably the correct signal instead of SIGKILL

slim-bean · 2020-11-10T16:17:17Z

Interesting it looks like docker only sends a SIGTERM, i think the general idea still applies though.

We handle the SIGINT/SIGTERM and initiate a shutdown, but will wait for any active sends to complete

A second SIGINT/SIGTERM would interrupt any active sends and immediately shut down the process.

I realize this may be met with some controversy as to what the correct behavior is here, trying to strike a balance between data durability and intuitiveness here, but I am interested in other opinions.

kavirajk · 2020-11-10T16:18:17Z

Hmm! SIGTERM is also for graceful shutdown! So we should handle it same way as SIGINT IMO!

kavirajk · 2020-11-10T16:22:02Z

Ah! Saw your latest comment after posting mine!

So yea! Agree with

Handling both SIGINT and SIGTERM with same behaviour
Graceful shutdown on first and immediate shutdown on second!

codecov-io · 2020-11-10T20:51:41Z

Codecov Report

Merging #2898 (984246b) into master (fd451d9) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
- Coverage   61.63%   61.56%   -0.08%     
==========================================
  Files         181      181              
  Lines       14712    14715       +3     
==========================================
- Hits         9068     9059       -9     
- Misses       4812     4828      +16     
+ Partials      832      828       -4

Impacted Files	Coverage Δ
pkg/promtail/client/client.go	`77.85% <100.00%> (+0.48%)`	⬆️
pkg/promtail/positions/positions.go	`46.80% <0.00%> (-11.71%)`	⬇️
pkg/promtail/targets/file/filetarget.go	`63.38% <0.00%> (-0.71%)`	⬇️

pkg/promtail/client/client.go

kavirajk · 2020-11-19T08:19:31Z

@cyriltovena @slim-bean Turns out docker doesn't send any signals (neither SIGINT or SIGTERM) to the driver when docker stop or docker kill is executed. Docker just calls driver's StopLogging method during these actions (docker stop or docker kill)

I just verified by adding signal handlers to the driver's main. But no signals is received (even though in the docker daemon log says SIGNAL is sent to the container) Something like

	// gracefull shutdown
	sig := make(chan os.Signal)
	signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)

	go func() {
		s := <-sig
		level.Info(util.Logger).Log("msg", "signal received", "signal", s)
	}()

So the best we could do currently is just make sure client's stop is canceling the batchSend's context. which we did in this PR!

Only pending is either we remove c.quit in this PR or just deprecate and remove it later!. @cyriltovena @slim-bean WDYT?

cyriltovena · 2020-11-19T08:41:10Z

I think Ed is reluctant to the idea of stopping sending batches. Which currently now with this PR we do.

My suggestion would be to have a two method:

Stop() : would wait for the end of sending batches and retries
StopNow(): would also wait for the last batch to be sent but not apply any retries in case of failure.

Then we could call StopNow() when the driver ask for termination instead of Stop(). This way this special behaviour is only for docker.

In Addition I would make the timeout of the default client to 5s for docker, and not 10s. This way we have guarantee that container will stop at the worst case after 5s, and that it actually tried to send the last batch.

kavirajk · 2020-11-19T10:21:58Z

@cyriltovena thanks :) nice idea 👍 . actually thought about that same before, but didn't do it because adding StopNow() breaks the client interface. So was not sure that time as that may break any other client implementations.

Looking into it. I can definitely add it if its minor changes!

codecov-io · 2020-11-19T11:01:35Z

Codecov Report

Merging #2898 (820ec24) into master (14a5e22) will decrease coverage by 0.10%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master    #2898      +/-   ##
==========================================
- Coverage   61.86%   61.76%   -0.11%     
==========================================
  Files         182      182              
  Lines       14870    14880      +10     
==========================================
- Hits         9200     9190      -10     
- Misses       4822     4843      +21     
+ Partials      848      847       -1

Impacted Files	Coverage Δ
cmd/docker-driver/config.go	`9.34% <ø> (ø)`
cmd/docker-driver/loki.go	`0.00% <0.00%> (ø)`
cmd/docker-driver/main.go	`0.00% <ø> (ø)`
cmd/fluent-bit/dque.go	`0.00% <0.00%> (ø)`
pkg/promtail/client/logger.go	`75.00% <0.00%> (-2.78%)`	⬇️
pkg/promtail/client/multi.go	`80.00% <0.00%> (-5.72%)`	⬇️
pkg/promtail/client/client.go	`76.92% <77.77%> (-0.45%)`	⬇️
pkg/promtail/positions/positions.go	`46.80% <0.00%> (-11.71%)`	⬇️
pkg/promtail/targets/file/filetarget.go	`64.08% <0.00%> (-2.12%)`	⬇️

kavirajk · 2020-11-23T09:38:20Z

@slim-bean can you take a look when you get a chance? thanks :)

stores `context` and its `cancel` in client and uses it to make upstream calls that need `ctx`. Also properly cancel's all the upstream calls that using `ctx` when client's Stop is called.

cyriltovena · 2020-11-23T11:49:15Z

pkg/promtail/client/client.go

 	var status int
 	for backoff.Ongoing() {
 		start := time.Now()
-		status, err = c.send(ctx, tenantID, buf)
+		status, err = c.send(c.ctx, tenantID, buf)


I think you're also cancelling the current batch and the last batch with this.

May be you should add a test to verify that StopNow() always try to send.

Use `StopNow()` from the docker-driver

slim-bean · 2020-11-24T13:11:50Z

cmd/docker-driver/config.go

-		Timeout: client.Timeout,
+		// Avoid blocking the docker-driver on the worst case
+		// https://github.com/grafana/loki/pull/2898#issuecomment-730218963
+		Timeout: 5 * time.Second,


I want to avoid this, I'd like for all clients to have the same timeout and backoff behavior

cyriltovena

LGTM

* Support Snappy compression for gRPC fix: grafana#2898 Signed-off-by: Wing924 <weihe924stephen@gmail.com> * changelog Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * Fix Signed-off-by: Wing924 <weihe924stephen@gmail.com> * fix Signed-off-by: Wing924 <weihe924stephen@gmail.com>

pull-request-size bot added the size/M label Nov 8, 2020

kavirajk changed the title ~~fix(docker-driver): Multiplex the client.batchSend~~ fix(docker-driver): Propagate promtail's client.Stop properly Nov 8, 2020

kavirajk force-pushed the fix/docker-driver branch from f806571 to 2980483 Compare November 8, 2020 20:40

kavirajk commented Nov 8, 2020

View reviewed changes

pkg/promtail/client/client.go Show resolved Hide resolved

kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 6d67178 to 2416feb Compare November 9, 2020 08:53

pull-request-size bot added size/L and removed size/M labels Nov 9, 2020

kavirajk requested review from slim-bean and a team November 9, 2020 11:26

kavirajk force-pushed the fix/docker-driver branch from 2416feb to 984246b Compare November 10, 2020 20:42

pull-request-size bot added size/M and removed size/L labels Nov 10, 2020

cyriltovena reviewed Nov 17, 2020

View reviewed changes

pkg/promtail/client/client.go Outdated Show resolved Hide resolved

kavirajk force-pushed the fix/docker-driver branch from 3ab5c51 to 8d91322 Compare November 19, 2020 10:34

kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 718de29 to 54472cb Compare November 19, 2020 10:52

kavirajk force-pushed the fix/docker-driver branch 2 times, most recently from 0f3a235 to 002c23d Compare November 19, 2020 12:28

kavirajk added 2 commits November 23, 2020 10:39

impr(Makefile): add support to take docker-driver image via ENV

e30a37f

impr(promtail/client): Add cancelling mechanism for batchSend

129ca28

stores `context` and its `cancel` in client and uses it to make upstream calls that need `ctx`. Also properly cancel's all the upstream calls that using `ctx` when client's Stop is called.

kavirajk force-pushed the fix/docker-driver branch from 002c23d to 533f56e Compare November 23, 2020 09:39

cyriltovena reviewed Nov 23, 2020

View reviewed changes

kavirajk force-pushed the fix/docker-driver branch from 533f56e to 220c261 Compare November 23, 2020 13:38

Add StopNow() method to stop the client without retries.

820ec24

Use `StopNow()` from the docker-driver

kavirajk force-pushed the fix/docker-driver branch from 220c261 to 820ec24 Compare November 23, 2020 14:03

Add test for StopNow

689db80

pull-request-size bot added size/L and removed size/M labels Nov 24, 2020

slim-bean reviewed Nov 24, 2020

View reviewed changes

cyriltovena approved these changes Nov 27, 2020

View reviewed changes

reverting back the client.Timeout

9355fa1

cyriltovena merged commit 410b9d9 into master Dec 1, 2020

cyriltovena deleted the fix/docker-driver branch December 1, 2020 10:18

CptDaniel mentioned this pull request Mar 1, 2021

if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker-driver): Propagate promtail's `client.Stop` properly #2898

fix(docker-driver): Propagate promtail's `client.Stop` properly #2898

kavirajk commented Nov 8, 2020 •

edited

Loading

cyriltovena commented Nov 9, 2020 •

edited

Loading

kavirajk commented Nov 9, 2020

cyriltovena commented Nov 10, 2020

kavirajk commented Nov 10, 2020

cyriltovena commented Nov 10, 2020

slim-bean commented Nov 10, 2020

kavirajk commented Nov 10, 2020

slim-bean commented Nov 10, 2020

slim-bean commented Nov 10, 2020

kavirajk commented Nov 10, 2020

kavirajk commented Nov 10, 2020

codecov-io commented Nov 10, 2020

kavirajk commented Nov 19, 2020

cyriltovena commented Nov 19, 2020 •

edited

Loading

kavirajk commented Nov 19, 2020

codecov-io commented Nov 19, 2020 •

edited

Loading

kavirajk commented Nov 23, 2020

cyriltovena Nov 23, 2020

slim-bean Nov 24, 2020

cyriltovena left a comment

fix(docker-driver): Propagate promtail's client.Stop properly #2898

fix(docker-driver): Propagate promtail's client.Stop properly #2898

Conversation

kavirajk commented Nov 8, 2020 • edited Loading

Before

After

cyriltovena commented Nov 9, 2020 • edited Loading

kavirajk commented Nov 9, 2020

cyriltovena commented Nov 10, 2020

kavirajk commented Nov 10, 2020

cyriltovena commented Nov 10, 2020

slim-bean commented Nov 10, 2020

kavirajk commented Nov 10, 2020

slim-bean commented Nov 10, 2020

slim-bean commented Nov 10, 2020

kavirajk commented Nov 10, 2020

kavirajk commented Nov 10, 2020

codecov-io commented Nov 10, 2020

Codecov Report

kavirajk commented Nov 19, 2020

cyriltovena commented Nov 19, 2020 • edited Loading

kavirajk commented Nov 19, 2020

codecov-io commented Nov 19, 2020 • edited Loading

Codecov Report

kavirajk commented Nov 23, 2020

cyriltovena Nov 23, 2020

Choose a reason for hiding this comment

slim-bean Nov 24, 2020

Choose a reason for hiding this comment

cyriltovena left a comment

Choose a reason for hiding this comment

fix(docker-driver): Propagate promtail's `client.Stop` properly #2898

fix(docker-driver): Propagate promtail's `client.Stop` properly #2898

kavirajk commented Nov 8, 2020 •

edited

Loading

cyriltovena commented Nov 9, 2020 •

edited

Loading

cyriltovena commented Nov 19, 2020 •

edited

Loading

codecov-io commented Nov 19, 2020 •

edited

Loading