Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Connection retries when scheduler restarts for dataflow and controller #5292

Merged
merged 5 commits into from
Feb 5, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Feb 5, 2024

What this PR does / why we need it:

In some cases during testing, dataflow-engine didnt reconnect to the new scheduler pod after a rolling update. The controller also suffers from the same issue. The fix is to retry (for now indefinitely) which is the same pattern used in agent (link), model-gateway (link) and pipeline-gateway (link).

Summary of changes:

  • Add indefinite retry to subscribePipelines (dataflow)
  • Add indefinite retry to startEventHanders (controller)

Which issue(s) this PR fixes:

Fixes INFRA-713 (internal)

Special notes for your reviewer:

@sakoush sakoush requested a review from lc525 as a code owner February 5, 2024 14:46
@sakoush sakoush changed the title fix(dataflow, controller): Connection retries when scheduler restarts fix(dataflow+controller): Connection retries when scheduler restarts Feb 5, 2024
@sakoush sakoush changed the title fix(dataflow+controller): Connection retries when scheduler restarts fix(dataflow): Connection retries when scheduler restarts Feb 5, 2024
@sakoush sakoush changed the title fix(dataflow): Connection retries when scheduler restarts fix: Connection retries when scheduler restarts for dataflow and controller Feb 5, 2024
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; This clearly fixes a large class of the restart issues we've been seeing. I'm still not entirely sure about the grpc.Dial and failures there, but if anything pops up it can be solved in a different PR, without delaying this one further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question regarding whether grpc.Dial on line 218 will stop retrying or not retry on certain errors. I'm still looking at grpc_retry to understand the exact behaviour

func (s *SchedulerClient) startEventHanders(namespace string, conn *grpc.ClientConn) {
retryFn := func(fn func(context context.Context, conn *grpc.ClientConn) error, context context.Context, conn *grpc.ClientConn) error {
logFailure := func(err error, delay time.Duration) {
s.logger.Error(err, "Scheduler not ready")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues here is that we'll probably be seeing each error 4 times when the scheduler is down (once from each of the coroutines below), but I'm not sure if we can sort that out easily.

@sakoush sakoush merged commit a0093db into SeldonIO:v2 Feb 5, 2024
4 of 7 checks passed
@sakoush sakoush added the v2 label Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants