Retry on server load fail #4647

sakoush · 2023-02-06T11:40:57Z

What this PR does / why we need it:
This PR adds the ability to retry loading / unloading models onto a server (control plane) with max number of attempts. This is different from the current usage of retries with backoff as we also cap retries with a number of attempts given that different models will have different latencies for loading / unloading.

We only do this on the control plane to get around the issue where we are loading an inference model and an associated explainer concurrently. So if an explainer tries to load first, it might fail if the underlying inference model has not loaded yet. With retries this should be able to eventually load the explainer.

Which issue(s) this PR fixes:

Fixes #4623

Special notes for your reviewer:

RafalSkolasinski · 2023-02-06T13:15:58Z

scheduler/cmd/agent/main.go

@@ -58,6 +58,10 @@ const (
 	maxElapsedTimeReadySubServiceBeforeStart = 15 * time.Minute // 15 mins is the default MaxElapsedTime
 	// period for subservice ready "cron"
 	periodReadySubService = 60 * time.Second
+	// number of retries for loading a model onto a server
+	maxLoadRetryCount = 6


nit: we could think about making it configurable in the future

RafalSkolasinski

LGTM

* fix type conversion and add test * Add a policy to retry with max count * wire up retry logic * change scope of helper functions * pass down defaults for retries * up number of retries on load * wire up correct unload const * remove unnecessary set of default value * bump maxLoadRetryCount to 10

sakoush added 2 commits February 6, 2023 10:25

fix type conversion and add test

f1a7ea8

Add a policy to retry with max count

bc47942

sakoush marked this pull request as draft February 6, 2023 11:41

sakoush added the v2 label Feb 6, 2023

sakoush changed the title ~~Issue 4623/retry server load fail~~ Retry on server load fail Feb 6, 2023

sakoush added 4 commits February 6, 2023 12:18

wire up retry logic

677acbe

change scope of helper functions

e1d884c

pass down defaults for retries

07b86e4

up number of retries on load

59a0316

sakoush requested review from ukclivecox and RafalSkolasinski February 6, 2023 12:35

sakoush marked this pull request as ready for review February 6, 2023 12:35

wire up correct unload const

b876b8f

RafalSkolasinski reviewed Feb 6, 2023

View reviewed changes

sakoush added 2 commits February 6, 2023 17:17

remove unnecessary set of default value

fcfa5b0

bump maxLoadRetryCount to 10

cb42fec

RafalSkolasinski approved these changes Feb 6, 2023

View reviewed changes

RafalSkolasinski merged commit dbbf351 into SeldonIO:v2 Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on server load fail #4647

Retry on server load fail #4647

sakoush commented Feb 6, 2023 •

edited

Loading

RafalSkolasinski Feb 6, 2023

RafalSkolasinski left a comment

Retry on server load fail #4647

Retry on server load fail #4647

Conversation

sakoush commented Feb 6, 2023 • edited Loading

RafalSkolasinski Feb 6, 2023

Choose a reason for hiding this comment

RafalSkolasinski left a comment

Choose a reason for hiding this comment

sakoush commented Feb 6, 2023 •

edited

Loading