-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Experiments and Models state fixes when reconnecting to scheduler #5320
fix: Experiments and Models state fixes when reconnecting to scheduler #5320
Conversation
converting to draft as we still need to deal with the model finalizer edge case gracefully. |
logFailure := func(err error, delay time.Duration) { | ||
logger.Error(err, "Scheduler not ready") | ||
} | ||
backOffExp := backoff.NewExponentialBackOff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the setting of backOffExp.MaxElapsedTime = 0
is no longer present. The default is 15 minutes, just wanted to make sure the removal was intentional as the retries will now no longer go on forever (as in the comment to startEventHandlers
. I assume this got removed because now there are additional retries at the grpc-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all calls via retryFn
have an outer for loop that will make it indefinite also resetting the backoff state.
e.g.
for {
err := retryFn(s.SubscribeModelEvents, conn, namespace, s.logger)
if err != nil {
s.logger.Error(err, "Subscribe ended for model events", "namespace", namespace)
} else {
s.logger.Info("Subscribe ended for model events", "namespace", namespace)
}
}
operator/scheduler/model.go
Outdated
if conn == nil { | ||
conn, err = s.getConnection(model.Namespace) | ||
if err != nil { | ||
return err, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two style nits:
- might be more clear to have a variable of type bool called something like
retryUnload
and return that or to return a type alias to bool just to make it more clear what that represents. - also, perhaps having error as the last return value (just to match typical go returns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; Some minor nits; it would be good to have one more design session on this to see in general how we design the state sync between the controller and scheduler. We might need to use something like distributed consensus (raft).
What this PR does / why we need it:
Sometimes when the scheduler restarts before sending an Experiment termination event to the controller, the deleted experiment event is lost. Instead we keep a list of all experiments event after they are terminated so that we are able to replay these events on the controller reconnecting to the scheduler. This follows the pattern used with pipelines in #5298
For models we cannot apply the same pattern, this is largely because the scheduler itself does not keep a persistent store for the list models and relies on the model servers when they reconnect to supply this information (i.e. the list of models that we have). This is an issue that we are going to sort out in a follow up work. However until then the solution proposed in this PR is that when the controller reconnects to the scheduler (e.g. in the case of a scheduler restart) then we check if any models in the deleting / terminating state can be deleted (by removing the finalizer). This is done by re-issuing an unload grpc call to the scheduler and if the scheduler replies with model not found then we can remove the finalizer. Obviously this has some drawbacks and it is not always certain that the scheduler has the uptodate information from the model servers but maybe it is good enough for now.
Which issue(s) this PR fixes:
Fixes INFRA-745 (internal)
Special notes for your reviewer:
Changes
handlePendingDeleteModels
to check if any terminating models can remove its finalizer.Notes on models
We have also looked at whether we need to do something similar for models:
Tests
Note: as a follow up work we should fix the issue where the controller connects to the scheduler before any model server and therefore the controller can delete models from k8s that are still available in the scheduler.