Fix cancel issue between Query Frontend and Query Schdeduler #5113

kavirajk · 2022-01-12T11:46:11Z

What this PR does / why we need it:
Query Frontend failed to send cancel signal even to Query Frontend Worker during some special cases(explained below)
And that makes querier to run till completion(even after cancellation) wasting lots of resources.

This scenario is analogous to this simple Go program (run in go-playground). The gist is default branch in go select loop takes always precedence when other branch operations are blocked (sending or receiving on the channel). In our case, we missed the sending of cancel signal

Now to the original issue.

The main observation is for every Querier that haven't received cancel, it's corresponding Scheduler also didn't get it.

single query on QF has split into 128 sub queries
And all of them got scheduled and picked up by the querier.
I added some logs in two points. Where QF send sending cancel to scheduler, And scheduler loop sends it to querier.

Then among those 128 subqueries, only 8 of them got cancel from QF to Scheduler, remaining 120 got dropped in QF itself, never even reached the Scheduler.

The root cause is, whenever frontend worker loop is busy with service request on the main select branch ( lisitening on w.requestCh ), and if during that time, QF send cancel request to Scheduler via cancelCh it will get block for sometime on the other end (coz no buffer). During that time, it got into default case and thus ignoring the cancel signal even before sending it to Scheduler.

Basically we need to make sure to send cancel signal to Scheduler, no matter how busy the frontend worker to pickup cancel from the cancelCh.

Which issue(s) this PR fixes:
Fixes #5132

Special notes for your reviewer:

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

pkg/lokifrontend/frontend/v2/frontend_test.go

pkg/lokifrontend/frontend/v2/frontend.go

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena

LGTM

@colega PTAL I want to make sure we didn't miss anything.

pkg/lokifrontend/frontend/v2/frontend_test.go

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

pkg/lokifrontend/frontend/v2/frontend.go

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

colega · 2022-01-13T10:14:10Z

I agree that this fixes the issue but I think we're overcomplicating it a little bit. Cancelling the querier request isn't a strict need but just politeness and avoids wasting resources, so it would be great to ensure it, but not application-critical.

If there's no need to ensure that cancellation has happened, I think just adding some sensible capacity to each worker's cancelCh should be enough (while keeping the default: pass section, to avoid blocking if that cancellation channel is really full).

Let's say make(chan uint64, 1024): if we have more than 1024 canceled requests waiting for worker to be free, then we probably have bigger issues to worry about anyway.

WDYT?

Edit, I tried with this:

diff --git pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
index fa6395783..7d32f2f70 100644
--- pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
+++ pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
@@ -191,7 +191,7 @@ func newFrontendSchedulerWorker(conn *grpc.ClientConn, schedulerAddr string, fro
                schedulerAddr: schedulerAddr,
                frontendAddr:  frontendAddr,
                requestCh:     requestCh,
-               cancelCh:      make(chan uint64),
+               cancelCh:      make(chan uint64, 1024),
        }
        w.ctx, w.cancel = context.WithCancel(context.Background())

And the test from this PR also passes 👍

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>

cyriltovena · 2022-01-13T10:41:27Z

I agree that this fixes the issue but I think we're overcomplicating it a little bit. Cancelling the querier request isn't a strict need but just politeness and avoids wasting resources, so it would be great to ensure it, but not application-critical.

If there's no need to ensure that cancellation has happened, I think just adding some sensible capacity to each worker's cancelCh should be enough (while keeping the default: pass section, to avoid blocking if that cancellation channel is really full).

Let's say make(chan uint64, 1024): if we have more than 1024 canceled requests waiting for worker to be free, then we probably have bigger issues to worry about anyway.

WDYT?

Edit, I tried with this:
diff --git pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
index fa6395783..7d32f2f70 100644
--- pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
+++ pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go
@@ -191,7 +191,7 @@ func newFrontendSchedulerWorker(conn *grpc.ClientConn, schedulerAddr string, fro
                schedulerAddr: schedulerAddr,
                frontendAddr:  frontendAddr,
                requestCh:     requestCh,
-               cancelCh:      make(chan uint64),
+               cancelCh:      make(chan uint64, 1024),
        }
        w.ctx, w.cancel = context.WithCancel(context.Background())
And the test from this PR also passes 👍

Yeah this is interesting 🤔 + the lock could be expensive.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

colega · 2022-01-13T11:25:08Z

pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go

+			default:
+				req.enqueue <- enqueueResult{status: failed}


I would add a warning here, or even an error: this shouldn't ever happen and indicates a bug.

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

With previous implementation, if worker was busy talking to scheduler, we didn't push the cancellation, keeping that query running. When cancelling a query, all its subqueries are cancelled at the same time, so this was most likely happening all the time (first subquery scheduled on this worker was canceled, the rest were not because worker was busy cancelling the first one). Also removed the `<-ctx.Done()` escape point when waiting for the enqueueing ACK and modified the enqueueing method to ensure that it always responds something. Fixes: #740 Inspired by: grafana/loki#5113 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Increase scheduler worker cancellation chan cap With previous implementation, if worker was busy talking to scheduler, we didn't push the cancellation, keeping that query running. When cancelling a query, all its subqueries are cancelled at the same time, so this was most likely happening all the time (first subquery scheduled on this worker was canceled, the rest were not because worker was busy cancelling the first one). Also removed the `<-ctx.Done()` escape point when waiting for the enqueueing ACK and modified the enqueueing method to ensure that it always responds something. Fixes: #740 Inspired by: grafana/loki#5113 Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Update CHANGELOG.md Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Remove comment about chan memory usage Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> * Update test comment Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> * Add resp.Error to the log when response is unknown Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Log the entire uknown response Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

Fix cancel issue between Query Frontend -> Query Schdeduler

7fcab44

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

pull-request-size bot added the size/M label Jan 12, 2022

kavirajk changed the title ~~Fix cancel issue between Query Frontend -> Query Schdeduler~~ Fix cancel issue between Query Frontend and Query Schdeduler Jan 12, 2022

kavirajk commented Jan 12, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go Outdated Show resolved Hide resolved

kavirajk added 3 commits January 12, 2022 13:01

remove assert check for message type

a19b764

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Handle the case where Frontend is running without Frontend workers

a97344a

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Make it into sendRequestCancel() func

3e3e238

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena reviewed Jan 12, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend_scheduler_worker.go Outdated Show resolved Hide resolved

Remove debug logs

7e56476

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena reviewed Jan 13, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend_test.go Show resolved Hide resolved

cyriltovena reviewed Jan 13, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend.go Outdated Show resolved Hide resolved

cyriltovena reviewed Jan 13, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend.go Outdated Show resolved Hide resolved

Fix note

68673e4

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena marked this pull request as ready for review January 13, 2022 09:47

cyriltovena requested a review from a team as a code owner January 13, 2022 09:47

cyriltovena requested a review from colega January 13, 2022 09:47

cyriltovena approved these changes Jan 13, 2022

View reviewed changes

colega reviewed Jan 13, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend_test.go Outdated Show resolved Hide resolved

Remove workers count debug log on test

e34baf3

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

colega reviewed Jan 13, 2022

View reviewed changes

pkg/lokifrontend/frontend/v2/frontend.go Outdated Show resolved Hide resolved

Add bug link to test

bd85e7d

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena and others added 2 commits January 13, 2022 11:38

Fixes the case where we cancel after enqueuing

a9e5c67

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

Update pkg/lokifrontend/frontend/v2/frontend.go

72f4fc9

Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>

Switch to a buffered channel instead

7c433a6

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

colega reviewed Jan 13, 2022

View reviewed changes

colega approved these changes Jan 13, 2022

View reviewed changes

Add error on unknown response status from scheduler

a6d8ab3

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

cyriltovena merged commit 37d0c6c into grafana:main Jan 13, 2022

colega mentioned this pull request Jan 13, 2022

Fix request cancellation when frontend worker is busy grafana/mimir#740

Closed

colega mentioned this pull request Jan 13, 2022

Increase scheduler worker cancellation chan cap grafana/mimir#741

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cancel issue between Query Frontend and Query Schdeduler #5113

Fix cancel issue between Query Frontend and Query Schdeduler #5113

kavirajk commented Jan 12, 2022 •

edited

Loading

cyriltovena left a comment

colega commented Jan 13, 2022 •

edited

Loading

cyriltovena commented Jan 13, 2022 •

edited

Loading

colega Jan 13, 2022

Fix cancel issue between Query Frontend and Query Schdeduler #5113

Fix cancel issue between Query Frontend and Query Schdeduler #5113

Conversation

kavirajk commented Jan 12, 2022 • edited Loading

cyriltovena left a comment

Choose a reason for hiding this comment

colega commented Jan 13, 2022 • edited Loading

cyriltovena commented Jan 13, 2022 • edited Loading

colega Jan 13, 2022

Choose a reason for hiding this comment

kavirajk commented Jan 12, 2022 •

edited

Loading

colega commented Jan 13, 2022 •

edited

Loading

cyriltovena commented Jan 13, 2022 •

edited

Loading