Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(operator): Cluster wide operator for V2 #4847

Merged
merged 37 commits into from
Jun 19, 2023
Merged

feat(operator): Cluster wide operator for V2 #4847

merged 37 commits into from
Jun 19, 2023

Conversation

ukclivecox
Copy link
Contributor

@ukclivecox ukclivecox commented May 17, 2023

Provides a cluster wide operator.

Fixes: #4735

  • Adds SeldonConfig and SeldonRuntime custom resources
  • create a new seldon-core-v2-runtime Helm chart that install components via a new SeldonRuntime into a namespace
  • Tests
  • Docs for helm and uprading

ImagePullSecrets

Updates also imagePullSecrets yaml processing to replace the existing single imagePullSecret in v2 branch which caused issues for handling StatefulSet statehandling via Banzai as that was adding an empty list element which failed the 3-way merge creation. The update allows a full list or nothing to be added via Helm. This will be improved when we remove Kustomize from the creation of Helm chart and instread use the Helm chart as the source of truth.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ukclivecox ukclivecox changed the title DRAFT : Cluster wide operator for V2 Cluster wide operator for V2 May 19, 2023
@ukclivecox ukclivecox requested a review from sakoush May 19, 2023 14:28
@@ -1,3 +1,5 @@
watchAnyNamespace: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check whether we need this still. Secrets are copied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been checked?

@ukclivecox ukclivecox added the v2 label May 24, 2023
@ukclivecox ukclivecox changed the title Cluster wide operator for V2 feat[operator]: Cluster wide operator for V2 Jun 7, 2023
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments so far.

Comment on lines +35 to +36
DataflowEngineName = "dataflow-engine"
HodometerName = "hodometer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I understand this is not part of this PR but these 2 services do not have seldon- prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future pr then?

@@ -0,0 +1,56 @@
package common
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert copyright. perhaps also check that all new files have copyrights? it is probably useful at some point in a future PR to add a check for the existence of copyrights.

@@ -0,0 +1,236 @@
/*
Copyright 2022 Seldon Technologies Ltd.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 2023?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR to update date, or add a range. Not actually sure if this is needed or not - will need to check

Comment on lines 219 to 222
const (
StatefulSetReadyReason = "StatefulSet replicas matches desired replicas"
StatefulSetNotReadyReason = "StatefulSet replicas does not match desired replicas"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this section be at the start of the document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move

req := &scheduler.StartExperimentRequest{
Experiment: experiment.AsSchedulerExperimentRequest(),
}
logger.Info("Start", "experiment name", experiment.Name)
_, err := grcpClient.StartExperiment(ctx, req, grpc_retry.WithMax(2))
_, err = grcpClient.StartExperiment(ctx, req, grpc_retry.WithMax(2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why specifically 2 retries? also extract as const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will refactor to const

if err != nil {
return err
}
return nil
}

func (s *SchedulerClient) SubscribeServerEvents(ctx context.Context) error {
func (s *SchedulerClient) SubscribeServerEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

if err != nil {
return err, s.checkErrorRetryable(model.Kind, model.Name, err)
}

return nil, false
}

func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context) error {
func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

return err, s.checkErrorRetryable(experiment.Kind, experiment.Name, err)
}

func (s *SchedulerClient) SubscribeExperimentEvents(ctx context.Context) error {
func (s *SchedulerClient) SubscribeExperimentEvents(ctx context.Context, namespace string, conn *grpc.ClientConn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

if err != nil {
return nil, err
}
err = s.smokeTestConnection(conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps add a note to describe why we do a smoke test to scheduler? Make sure it is up before proceeding further?

expectedConfig: &TracingConfig{
Enable: true,
OtelExporterEndpoint: "0.0.0.0:1234",
Ratio: 0.5,
Ratio: "0.5",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test for invalid ratio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test to loading config and added go test for this check

Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the rest of my comments.

operator/main.go Outdated
Comment on lines 71 to 75
watchNamespace := namespace
if clusterwide {
watchNamespace = ""
}
setupLog.Info("Starting manager", "clusterwide", clusterwide)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is slightly not clear to me why this is enough to enable clusterwide perhaps add a note with a description?
can we have clusterwide and per namespace install?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its namespaced then the controller won't watch for changes outside its namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added note

{
Port: DefaultXdsPort,
TargetPort: intstr.FromString(DefaultXdsPortName),
Name: fmt.Sprintf("%s%s", serviceConfig.GrpcServicePrefix, DefaultXdsPortName),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is GrpcServicePrefix something new in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its part of the original Helm setup of services which we need to include when creating V1 SVCs to allow proper operation in some service meshes such as istio

annotator)
g.Expect(err).To(BeNil())
err = sr.Reconcile()
if test.error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we have test cases with errors? perhaps add if relevant or simplify the test code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed err check as only k8s errors would happen and this won't happen with FakeClient

@@ -0,0 +1,142 @@
package server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no copyrights

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add


opentelemetry:
endpoint: seldon-collector:4317
endpoint: seldon-collector.seldon-mesh:4317
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this work still with per namespace install? I am assuming that users would need to provide another value instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption its in the manager install namespace but agree this needs to be changed to be based on install namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change so it uses namespace of install

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the reason its hardwired is we don't know where you will install opentelemetry or kafka so its up to you to change it really.

meta)
g.Expect(err).To(BeNil())
err = sr.Reconcile()
if test.error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test cases with errors

error bool
}

tests := []test{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add the other components here as extra test cases?

annotator)
g.Expect(err).To(BeNil())
err = sr.Reconcile()
if test.error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test cases with errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Comment on lines 64 to 65
if test.error {
g.Expect(err).ToNot(BeNil())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -4,7 +4,7 @@ We provide several Helm charts.

* `seldon-core-v2-crds` : cluster wide install of custom resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add docs on how to install clusterwide and the implications?

@ukclivecox ukclivecox changed the title feat[operator]: Cluster wide operator for V2 feat(operator): Cluster wide operator for V2 Jun 14, 2023
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One minor comment whether we need to fix the raw yaml manifests to seldon-mesh namespace. Although I think they need all to be recreated and checked in?

k8s/Makefile Outdated
mv ${HELM_SERVERS_BASE}/.seldon-v2-servers.yaml ${HELM_SERVERS_BASE}/seldon-v2-servers.yaml

.PHONY: create-yaml
create-yaml:
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs > yaml/certs.yaml
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs --namespace seldon-mesh > yaml/certs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I have removed the namespace check this PR for more info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the raw yaml to have seldon-mesh namespace hardcoded in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR needs to be checked further as not sure how best to handle Roles which I think require namespaces but maybe not.

@@ -28,20 +28,6 @@ metadata:
---
# Source: seldon-core-v2-setup/templates/seldon-v2-components.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to recreate all yaml manifests?

…rther docs for helm and new docs for upgrading
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments regarding the creation of raw yaml manifests.

k8s/Makefile Outdated
mv ${HELM_SERVERS_BASE}/.seldon-v2-servers.yaml ${HELM_SERVERS_BASE}/seldon-v2-servers.yaml

.PHONY: create-yaml
create-yaml:
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs > yaml/certs.yaml
helm template seldon-core-v2-certs ./helm-charts/seldon-core-v2-certs --namespace seldon-mesh > yaml/certs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the raw yaml to have seldon-mesh namespace hardcoded in?

k8s/Makefile Outdated
helm template seldon-core-v2-components ./helm-charts/seldon-core-v2-setup > yaml/components.yaml
helm template seldon-core-v2-servers ./helm-charts/seldon-core-v2-servers > yaml/servers.yaml
helm template seldon-core-v2-components ./helm-charts/seldon-core-v2-setup --namespace seldon-mesh > yaml/components.yaml
helm template seldon-core-v2-servers ./helm-charts/seldon-core-v2-runtime --namespace seldon-mesh > yaml/servers.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be reflected in the new change of having runtime and servers?

docs/source/contents/upgrading/index.md Outdated Show resolved Hide resolved
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants