Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Experiments and Models state fixes when reconnecting to scheduler #5320

Merged
merged 16 commits into from
Feb 16, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Feb 15, 2024

What this PR does / why we need it:
Sometimes when the scheduler restarts before sending an Experiment termination event to the controller, the deleted experiment event is lost. Instead we keep a list of all experiments event after they are terminated so that we are able to replay these events on the controller reconnecting to the scheduler. This follows the pattern used with pipelines in #5298

For models we cannot apply the same pattern, this is largely because the scheduler itself does not keep a persistent store for the list models and relies on the model servers when they reconnect to supply this information (i.e. the list of models that we have). This is an issue that we are going to sort out in a follow up work. However until then the solution proposed in this PR is that when the controller reconnects to the scheduler (e.g. in the case of a scheduler restart) then we check if any models in the deleting / terminating state can be deleted (by removing the finalizer). This is done by re-issuing an unload grpc call to the scheduler and if the scheduler replies with model not found then we can remove the finalizer. Obviously this has some drawbacks and it is not always certain that the scheduler has the uptodate information from the model servers but maybe it is good enough for now.

Which issue(s) this PR fixes:

Fixes INFRA-745 (internal)

Special notes for your reviewer:

Changes

  • Do not delete experiments from scheduler local store (BadgerDB)
  • Return all experiments (with terminated ones) on controller connecting to scheduler
  • Upon controller reconnecting to the scheduler, we run handlePendingDeleteModels to check if any terminating models can remove its finalizer.
  • Add exponential backoff to all gprc calls from controller to scheduler with the hope to make sure that operations are sent successfully to the scheduler (as otherwise the a load/unload event can be missed if the scheduler is out).

Notes on models

We have also looked at whether we need to do something similar for models:

  • If the scheduler restarts (and loosing its model state), the scheduler might not have any reference to the model that are terminated anymore and if the restart happened before the scheduler sends a model terminated event to the controller then the controller can be stuck on the deletion of the model (because of the model finalizer). We re-issue unload from the controller in this case.
  • If the controller restarts then it is fine as it will try to reconcile the state with the scheduler and would remove the models that are not in the scheduler state anymore.

Tests

  • added a long unload latency on the model, calling unload (kubectl) and then killing the scheduler -> model is gone from controller
  • deleting the scheduling, calling load (kubectl), bring back scheduler -> model is loaded.

Note: as a follow up work we should fix the issue where the controller connects to the scheduler before any model server and therefore the controller can delete models from k8s that are still available in the scheduler.

@sakoush sakoush requested a review from lc525 as a code owner February 15, 2024 09:37
@sakoush sakoush changed the title Fix(controller): Experiments state fixes when reconnecting to scheduler fix(controller): Experiments state fixes when reconnecting to scheduler Feb 15, 2024
@sakoush sakoush marked this pull request as draft February 15, 2024 10:38
@sakoush
Copy link
Member Author

sakoush commented Feb 15, 2024

converting to draft as we still need to deal with the model finalizer edge case gracefully.

@sakoush sakoush changed the title fix(controller): Experiments state fixes when reconnecting to scheduler fix(controller): Experiments and Models state fixes when reconnecting to scheduler Feb 16, 2024
@sakoush sakoush marked this pull request as ready for review February 16, 2024 11:46
@sakoush sakoush changed the title fix(controller): Experiments and Models state fixes when reconnecting to scheduler fix: Experiments and Models state fixes when reconnecting to scheduler Feb 16, 2024
@sakoush sakoush added the v2 label Feb 16, 2024
logFailure := func(err error, delay time.Duration) {
logger.Error(err, "Scheduler not ready")
}
backOffExp := backoff.NewExponentialBackOff()
Copy link
Member

@lc525 lc525 Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the setting of backOffExp.MaxElapsedTime = 0 is no longer present. The default is 15 minutes, just wanted to make sure the removal was intentional as the retries will now no longer go on forever (as in the comment to startEventHandlers. I assume this got removed because now there are additional retries at the grpc-level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all calls via retryFn have an outer for loop that will make it indefinite also resetting the backoff state.

e.g.

		for {
			err := retryFn(s.SubscribeModelEvents, conn, namespace, s.logger)
			if err != nil {
				s.logger.Error(err, "Subscribe ended for model events", "namespace", namespace)
			} else {
				s.logger.Info("Subscribe ended for model events", "namespace", namespace)
			}
		}

if conn == nil {
conn, err = s.getConnection(model.Namespace)
if err != nil {
return err, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two style nits:

  • might be more clear to have a variable of type bool called something like retryUnload and return that or to return a type alias to bool just to make it more clear what that represents.
  • also, perhaps having error as the last return value (just to match typical go returns)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in b62bb19 and f90ea17

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; Some minor nits; it would be good to have one more design session on this to see in general how we design the state sync between the controller and scheduler. We might need to use something like distributed consensus (raft).

@sakoush sakoush merged commit 3b873ab into SeldonIO:v2 Feb 16, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants