Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on server load fail #4647

Merged
merged 9 commits into from
Feb 6, 2023
Merged

Retry on server load fail #4647

merged 9 commits into from
Feb 6, 2023

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Feb 6, 2023

What this PR does / why we need it:
This PR adds the ability to retry loading / unloading models onto a server (control plane) with max number of attempts. This is different from the current usage of retries with backoff as we also cap retries with a number of attempts given that different models will have different latencies for loading / unloading.

We only do this on the control plane to get around the issue where we are loading an inference model and an associated explainer concurrently. So if an explainer tries to load first, it might fail if the underlying inference model has not loaded yet. With retries this should be able to eventually load the explainer.

Which issue(s) this PR fixes:

Fixes #4623

Special notes for your reviewer:

@sakoush sakoush marked this pull request as draft February 6, 2023 11:41
@sakoush sakoush added the v2 label Feb 6, 2023
@sakoush sakoush changed the title Issue 4623/retry server load fail Retry on server load fail Feb 6, 2023
@sakoush sakoush marked this pull request as ready for review February 6, 2023 12:35
@@ -58,6 +58,10 @@ const (
maxElapsedTimeReadySubServiceBeforeStart = 15 * time.Minute // 15 mins is the default MaxElapsedTime
// period for subservice ready "cron"
periodReadySubService = 60 * time.Second
// number of retries for loading a model onto a server
maxLoadRetryCount = 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could think about making it configurable in the future

Copy link
Contributor

@RafalSkolasinski RafalSkolasinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RafalSkolasinski RafalSkolasinski merged commit dbbf351 into SeldonIO:v2 Feb 6, 2023
RafalSkolasinski pushed a commit that referenced this pull request Feb 6, 2023
* fix type conversion and add test

* Add a policy to retry with max count

* wire up retry logic

* change scope of helper functions

* pass down defaults for retries

* up number of retries on load

* wire up correct unload const

* remove unnecessary set of default value

* bump maxLoadRetryCount to 10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants