fix: Experiments and Models state fixes when reconnecting to scheduler #5320

sakoush · 2024-02-15T09:37:02Z

What this PR does / why we need it:
Sometimes when the scheduler restarts before sending an Experiment termination event to the controller, the deleted experiment event is lost. Instead we keep a list of all experiments event after they are terminated so that we are able to replay these events on the controller reconnecting to the scheduler. This follows the pattern used with pipelines in #5298

For models we cannot apply the same pattern, this is largely because the scheduler itself does not keep a persistent store for the list models and relies on the model servers when they reconnect to supply this information (i.e. the list of models that we have). This is an issue that we are going to sort out in a follow up work. However until then the solution proposed in this PR is that when the controller reconnects to the scheduler (e.g. in the case of a scheduler restart) then we check if any models in the deleting / terminating state can be deleted (by removing the finalizer). This is done by re-issuing an unload grpc call to the scheduler and if the scheduler replies with model not found then we can remove the finalizer. Obviously this has some drawbacks and it is not always certain that the scheduler has the uptodate information from the model servers but maybe it is good enough for now.

Which issue(s) this PR fixes:

Fixes INFRA-745 (internal)

Special notes for your reviewer:

Changes

Do not delete experiments from scheduler local store (BadgerDB)
Return all experiments (with terminated ones) on controller connecting to scheduler
Upon controller reconnecting to the scheduler, we run handlePendingDeleteModels to check if any terminating models can remove its finalizer.
Add exponential backoff to all gprc calls from controller to scheduler with the hope to make sure that operations are sent successfully to the scheduler (as otherwise the a load/unload event can be missed if the scheduler is out).

Notes on models

We have also looked at whether we need to do something similar for models:

If the scheduler restarts (and loosing its model state), the scheduler might not have any reference to the model that are terminated anymore and if the restart happened before the scheduler sends a model terminated event to the controller then the controller can be stuck on the deletion of the model (because of the model finalizer). We re-issue unload from the controller in this case.
If the controller restarts then it is fine as it will try to reconcile the state with the scheduler and would remove the models that are not in the scheduler state anymore.

Tests

added a long unload latency on the model, calling unload (kubectl) and then killing the scheduler -> model is gone from controller
deleting the scheduling, calling load (kubectl), bring back scheduler -> model is loaded.

Note: as a follow up work we should fix the issue where the controller connects to the scheduler before any model server and therefore the controller can delete models from k8s that are still available in the scheduler.

sakoush · 2024-02-15T10:39:07Z

converting to draft as we still need to deal with the model finalizer edge case gracefully.

lc525 · 2024-02-16T16:08:51Z

operator/scheduler/client.go

+	logFailure := func(err error, delay time.Duration) {
+		logger.Error(err, "Scheduler not ready")
+	}
+	backOffExp := backoff.NewExponentialBackOff()


I see the setting of backOffExp.MaxElapsedTime = 0 is no longer present. The default is 15 minutes, just wanted to make sure the removal was intentional as the retries will now no longer go on forever (as in the comment to startEventHandlers. I assume this got removed because now there are additional retries at the grpc-level?

all calls via retryFn have an outer for loop that will make it indefinite also resetting the backoff state.

e.g.

for { err := retryFn(s.SubscribeModelEvents, conn, namespace, s.logger) if err != nil { s.logger.Error(err, "Subscribe ended for model events", "namespace", namespace) } else { s.logger.Info("Subscribe ended for model events", "namespace", namespace) } }

lc525 · 2024-02-16T17:04:37Z

operator/scheduler/model.go

+	if conn == nil {
+		conn, err = s.getConnection(model.Namespace)
+		if err != nil {
+			return err, true


Just two style nits:

might be more clear to have a variable of type bool called something like retryUnload and return that or to return a type alias to bool just to make it more clear what that represents.

also, perhaps having error as the last return value (just to match typical go returns)

done in b62bb19 and f90ea17

lc525

lgtm; Some minor nits; it would be good to have one more design session on this to see in general how we design the state sync between the controller and scheduler. We might need to use something like distributed consensus (raft).

sakoush added 2 commits February 14, 2024 16:53

do not delete experiments

47bc400

return all experiments from GetExperiments

0701f2d

sakoush requested a review from lc525 as a code owner February 15, 2024 09:37

sakoush changed the title ~~Fix(controller): Experiments state fixes when reconnecting to scheduler~~ fix(controller): Experiments state fixes when reconnecting to scheduler Feb 15, 2024

sakoush marked this pull request as draft February 15, 2024 10:38

sakoush added 9 commits February 15, 2024 14:53

deleting finalizers of models that are terminating on reconnect

e20ce3d

pass namespace to subscribe handlers

6ec7f02

refactor to a func and do retries

ebd59cd

tidy up code

d0c1534

remove extra log msg

67a851e

increase retries

ab67fe2

do not create conn if it exists

8a3a877

run handlePendingModels as goroutine

06c4b5d

add exponential backoff when connectiong to scheduler

943c680

sakoush changed the title ~~fix(controller): Experiments state fixes when reconnecting to scheduler~~ fix(controller): Experiments and Models state fixes when reconnecting to scheduler Feb 16, 2024

sakoush marked this pull request as ready for review February 16, 2024 11:46

sakoush changed the title ~~fix(controller): Experiments and Models state fixes when reconnecting to scheduler~~ fix: Experiments and Models state fixes when reconnecting to scheduler Feb 16, 2024

sakoush added the v2 label Feb 16, 2024

sakoush added 2 commits February 16, 2024 11:58

fix linting

6824972

increase backoff/retry to ~10 minutes

5134d12

lc525 reviewed Feb 16, 2024

View reviewed changes

lc525 approved these changes Feb 16, 2024

View reviewed changes

sakoush added 3 commits February 16, 2024 18:43

be explicit about the boolean variable we return in UnloadModel

b62bb19

make error last return param

f90ea17

fix lint

01953bd

sakoush merged commit 3b873ab into SeldonIO:v2 Feb 16, 2024
2 of 3 checks passed

lynnmatrix mentioned this pull request Mar 22, 2024

fix(operator): Subscriptions to scheduler stop if scheduler restarts #5461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

sakoush commented Feb 15, 2024 •

edited

Loading

sakoush commented Feb 15, 2024

lc525 Feb 16, 2024 •

edited

Loading

sakoush Feb 16, 2024

lc525 Feb 16, 2024

sakoush Feb 16, 2024

lc525 left a comment

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

Conversation

sakoush commented Feb 15, 2024 • edited Loading

Changes

Notes on models

Tests

sakoush commented Feb 15, 2024

lc525 Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

sakoush Feb 16, 2024

Choose a reason for hiding this comment

lc525 Feb 16, 2024

Choose a reason for hiding this comment

sakoush Feb 16, 2024

Choose a reason for hiding this comment

lc525 left a comment

Choose a reason for hiding this comment

sakoush commented Feb 15, 2024 •

edited

Loading

lc525 Feb 16, 2024 •

edited

Loading