-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(operator): Cluster wide operator for V2 #4847
Conversation
…supported in CRDs - also trace endpoint has namespace added
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
kafka/strimzi/values.yaml
Outdated
@@ -1,3 +1,5 @@ | |||
watchAnyNamespace: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check whether we need this still. Secrets are copied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has this been checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments so far.
DataflowEngineName = "dataflow-engine" | ||
HodometerName = "hodometer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I understand this is not part of this PR but these 2 services do not have seldon-
prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future pr then?
@@ -0,0 +1,56 @@ | |||
package common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert copyright. perhaps also check that all new files have copyrights? it is probably useful at some point in a future PR to add a check for the existence of copyrights.
@@ -0,0 +1,236 @@ | |||
/* | |||
Copyright 2022 Seldon Technologies Ltd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to 2023?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate PR to update date, or add a range. Not actually sure if this is needed or not - will need to check
const ( | ||
StatefulSetReadyReason = "StatefulSet replicas matches desired replicas" | ||
StatefulSetNotReadyReason = "StatefulSet replicas does not match desired replicas" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this section be at the start of the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will move
operator/scheduler/experiment.go
Outdated
req := &scheduler.StartExperimentRequest{ | ||
Experiment: experiment.AsSchedulerExperimentRequest(), | ||
} | ||
logger.Info("Start", "experiment name", experiment.Name) | ||
_, err := grcpClient.StartExperiment(ctx, req, grpc_retry.WithMax(2)) | ||
_, err = grcpClient.StartExperiment(ctx, req, grpc_retry.WithMax(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why specifically 2 retries? also extract as const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will refactor to const
operator/scheduler/server.go
Outdated
if err != nil { | ||
return err | ||
} | ||
return nil | ||
} | ||
|
||
func (s *SchedulerClient) SubscribeServerEvents(ctx context.Context) error { | ||
func (s *SchedulerClient) SubscribeServerEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using namespace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
operator/scheduler/model.go
Outdated
if err != nil { | ||
return err, s.checkErrorRetryable(model.Kind, model.Name, err) | ||
} | ||
|
||
return nil, false | ||
} | ||
|
||
func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context) error { | ||
func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
operator/scheduler/experiment.go
Outdated
return err, s.checkErrorRetryable(experiment.Kind, experiment.Name, err) | ||
} | ||
|
||
func (s *SchedulerClient) SubscribeExperimentEvents(ctx context.Context) error { | ||
func (s *SchedulerClient) SubscribeExperimentEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
if err != nil { | ||
return nil, err | ||
} | ||
err = s.smokeTestConnection(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps add a note to describe why we do a smoke test to scheduler? Make sure it is up before proceeding further?
expectedConfig: &TracingConfig{ | ||
Enable: true, | ||
OtelExporterEndpoint: "0.0.0.0:1234", | ||
Ratio: 0.5, | ||
Ratio: "0.5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test for invalid ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test to loading config and added go test for this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the rest of my comments.
operator/main.go
Outdated
watchNamespace := namespace | ||
if clusterwide { | ||
watchNamespace = "" | ||
} | ||
setupLog.Info("Starting manager", "clusterwide", clusterwide) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is slightly not clear to me why this is enough to enable clusterwide
perhaps add a note with a description?
can we have clusterwide
and per namespace
install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its namespaced then the controller won't watch for changes outside its namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added note
{ | ||
Port: DefaultXdsPort, | ||
TargetPort: intstr.FromString(DefaultXdsPortName), | ||
Name: fmt.Sprintf("%s%s", serviceConfig.GrpcServicePrefix, DefaultXdsPortName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is GrpcServicePrefix
something new in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its part of the original Helm setup of services which we need to include when creating V1 SVCs to allow proper operation in some service meshes such as istio
annotator) | ||
g.Expect(err).To(BeNil()) | ||
err = sr.Reconcile() | ||
if test.error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we have test cases with errors? perhaps add if relevant or simplify the test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed err check as only k8s errors would happen and this won't happen with FakeClient
@@ -0,0 +1,142 @@ | |||
package server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no copyrights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add
|
||
opentelemetry: | ||
endpoint: seldon-collector:4317 | ||
endpoint: seldon-collector.seldon-mesh:4317 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this work still with per namespace install? I am assuming that users would need to provide another value instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption its in the manager install namespace but agree this needs to be changed to be based on install namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change so it uses namespace of install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the reason its hardwired is we don't know where you will install opentelemetry or kafka so its up to you to change it really.
meta) | ||
g.Expect(err).To(BeNil()) | ||
err = sr.Reconcile() | ||
if test.error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no test cases with errors
error bool | ||
} | ||
|
||
tests := []test{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add the other components here as extra test cases?
annotator) | ||
g.Expect(err).To(BeNil()) | ||
err = sr.Reconcile() | ||
if test.error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no test cases with errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
if test.error { | ||
g.Expect(err).ToNot(BeNil()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
@@ -4,7 +4,7 @@ We provide several Helm charts. | |||
|
|||
* `seldon-core-v2-crds` : cluster wide install of custom resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add docs on how to install clusterwide and the implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
One minor comment whether we need to fix the raw yaml manifests to seldon-mesh
namespace. Although I think they need all to be recreated and checked in?
k8s/Makefile
Outdated
mv ${HELM_SERVERS_BASE}/.seldon-v2-servers.yaml ${HELM_SERVERS_BASE}/seldon-v2-servers.yaml | ||
|
||
.PHONY: create-yaml | ||
create-yaml: | ||
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs > yaml/certs.yaml | ||
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs --namespace seldon-mesh > yaml/certs.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively I have removed the namespace
check this PR for more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want the raw yaml to have seldon-mesh
namespace hardcoded in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR needs to be checked further as not sure how best to handle Roles which I think require namespaces but maybe not.
k8s/yaml/components.yaml
Outdated
@@ -28,20 +28,6 @@ metadata: | |||
--- | |||
# Source: seldon-core-v2-setup/templates/seldon-v2-components.yaml | |||
apiVersion: rbac.authorization.k8s.io/v1 | |||
kind: ClusterRoleBinding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to recreate all yaml manifests?
…rther docs for helm and new docs for upgrading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments regarding the creation of raw yaml manifests.
k8s/Makefile
Outdated
mv ${HELM_SERVERS_BASE}/.seldon-v2-servers.yaml ${HELM_SERVERS_BASE}/seldon-v2-servers.yaml | ||
|
||
.PHONY: create-yaml | ||
create-yaml: | ||
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs > yaml/certs.yaml | ||
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs --namespace seldon-mesh > yaml/certs.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want the raw yaml to have seldon-mesh
namespace hardcoded in?
k8s/Makefile
Outdated
helm template seldon-core-v2-components ./helm-charts/seldon-core-v2-setup > yaml/components.yaml | ||
helm template seldon-core-v2-servers ./helm-charts/seldon-core-v2-servers > yaml/servers.yaml | ||
helm template seldon-core-v2-components ./helm-charts/seldon-core-v2-setup --namespace seldon-mesh > yaml/components.yaml | ||
helm template seldon-core-v2-servers ./helm-charts/seldon-core-v2-runtime --namespace seldon-mesh > yaml/servers.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be reflected in the new change of having runtime
and servers
?
Co-authored-by: Sherif Akoush <sherif.akoush@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Provides a cluster wide operator.
Fixes: #4735
ImagePullSecrets
Updates also imagePullSecrets yaml processing to replace the existing single imagePullSecret in v2 branch which caused issues for handling StatefulSet statehandling via Banzai as that was adding an empty list element which failed the 3-way merge creation. The update allows a full list or nothing to be added via Helm. This will be improved when we remove Kustomize from the creation of Helm chart and instread use the Helm chart as the source of truth.