fix: Connection retries when scheduler restarts for dataflow and controller #5292

sakoush · 2024-02-05T14:46:49Z

What this PR does / why we need it:

In some cases during testing, dataflow-engine didnt reconnect to the new scheduler pod after a rolling update. The controller also suffers from the same issue. The fix is to retry (for now indefinitely) which is the same pattern used in agent (link), model-gateway (link) and pipeline-gateway (link).

Summary of changes:

Add indefinite retry to subscribePipelines (dataflow)
Add indefinite retry to startEventHanders (controller)

Which issue(s) this PR fixes:

Fixes INFRA-713 (internal)

Special notes for your reviewer:

lc525

lgtm; This clearly fixes a large class of the restart issues we've been seeing. I'm still not entirely sure about the grpc.Dial and failures there, but if anything pops up it can be solved in a different PR, without delaying this one further.

lc525 · 2024-02-05T16:45:36Z

operator/scheduler/client.go

I had a question regarding whether grpc.Dial on line 218 will stop retrying or not retry on certain errors. I'm still looking at grpc_retry to understand the exact behaviour

lc525 · 2024-02-05T17:48:14Z

operator/scheduler/client.go

 func (s *SchedulerClient) startEventHanders(namespace string, conn *grpc.ClientConn) {
+	retryFn := func(fn func(context context.Context, conn *grpc.ClientConn) error, context context.Context, conn *grpc.ClientConn) error {
+		logFailure := func(err error, delay time.Duration) {
+			s.logger.Error(err, "Scheduler not ready")


One of the issues here is that we'll probably be seeing each error 4 times when the scheduler is down (once from each of the coroutines below), but I'm not sure if we can sort that out easily.

sakoush added 5 commits February 5, 2024 10:57

add retry for connection drop

9255de4

fix typo

ece5b2b

add retry for controller grpc streams

c583fe1

add helper function for backoff retry

2e0908b

add comment

be2ffed

sakoush requested a review from lc525 as a code owner February 5, 2024 14:46

sakoush changed the title ~~fix(dataflow, controller): Connection retries when scheduler restarts~~ fix(dataflow+controller): Connection retries when scheduler restarts Feb 5, 2024

sakoush changed the title ~~fix(dataflow+controller): Connection retries when scheduler restarts~~ fix(dataflow): Connection retries when scheduler restarts Feb 5, 2024

sakoush changed the title ~~fix(dataflow): Connection retries when scheduler restarts~~ fix: Connection retries when scheduler restarts for dataflow and controller Feb 5, 2024

lc525 approved these changes Feb 5, 2024

View reviewed changes

sakoush merged commit a0093db into SeldonIO:v2 Feb 5, 2024
4 of 7 checks passed

sakoush added the v2 label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Connection retries when scheduler restarts for dataflow and controller #5292

fix: Connection retries when scheduler restarts for dataflow and controller #5292

sakoush commented Feb 5, 2024

lc525 left a comment

lc525 Feb 5, 2024

lc525 Feb 5, 2024

fix: Connection retries when scheduler restarts for dataflow and controller #5292

fix: Connection retries when scheduler restarts for dataflow and controller #5292

Conversation

sakoush commented Feb 5, 2024

lc525 left a comment

Choose a reason for hiding this comment

lc525 Feb 5, 2024

Choose a reason for hiding this comment

lc525 Feb 5, 2024

Choose a reason for hiding this comment