[backport]: changes from rhods_2.4 to rhods_2.5 (red-hat-data-service…

…s#129) * [cherry-pick]: split workbenches image into 2 params.env file Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Update opendatahub label (cherry picked from commit 3e975f9) (cherry picked from commit 9f8b649) * Update Codeflare manifests path (cherry picked from commit 014396c) (cherry picked from commit 5f1c0d4) * Move creation of default DSC (cherry picked from commit ab33109) (cherry picked from commit 00ddd6c) * update(manifests): enable kserve, modelmesh and workbenches - dashboard and modelmesh-monitoring still from odh-manifests Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Fix cherry-pick for dsci * fix(mm): set the new logic for modelmesh Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Fix the KF deployment: * fix(monitoring): do the switch for dev mode to not send alert Signed-off-by: Wen Zhou <wenzhou@redhat.com> (cherry picked from commit 001cad1) * refactor: reduce alert level for codeflare operator * Update(manifests): for monitoring - remove https:// for dashbaord target - add nwp from odh-deployer - fix: wrong service name for operator, this is defined in CSV - port: do not use https but 8080 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Fix manifests for monitoring (cherry picked from commit 85883f102bc15f2343c0f6afe253a29a4ff3f64f) * Revert changes to prometheus port Changes to prometheus port makes the route inaccessible * fix rebase * fix(dsci): missing label on namespaces (red-hat-data-services#98) - add SM which is in modelmesh-monitroing into operator monitoring - add roles which are in modelmesh-monitoring into ours too - apply 3 labels to both monitoring and application namespace (which is v1 doing) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(monitoring): typo (red-hat-data-services#101) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * update(monitoring) - remove hardcoded app. namespace in segment manifests - remove hardcoded monitoring. namepsace in base manifests - add placeholder to inject monitoring namespace in Servicemonitor Signed-off-by: Wen Zhou <wenzhou@redhat.com> * uplift: package version - github.com/operator-framework/operator-lifecycle-manager/releases/tag/v0.26.0 - github.com/openshift/api to latest v0.0.0 Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Remove odh csv * fix(crd): do not set ownerreference on CRD (opendatahub-io#725) - we covered the case when set component from Managed to Remvoe - this is to cover the case when set have component as Managed and delete DSC CR - so if we do not set at first it wont get deleted Signed-off-by: Wen Zhou <wenzhou@redhat.com> (cherry picked from commit e9461e0) * Fix DSCI Patch * update(monitoring): metrics (red-hat-data-services#107) * update(monitoring): - add log in pod for QE to see it is dev mode cluster - add two metrics: i do not think they are used in this config but they are presented in v1 config , so i add back - move recording for workbench to correct rule file - remove operator-alerting.rules it is not used in v1 to keep it simple - fix: openshift-monitoring is using web as port name and our port - add more comments for the config and comments out not needed config - add egress for odh monitoring and add cluster monitoring NS for ingress - keep rhdos_aggerate_avaiablity from proemtehusrules along with 2 users reason for this is: PSI does not get non openshift-* or kube-* NS metrics to cluster-monitoring prometheus. as cluster-monitoring prometheus-k8s only use prometheusrule not serivcemonitor ? - from test result: if our monitoring ns not set cluster-monitoring, there is no targets on federation2 and no rhods_aggreated_in metrics - fix(monitoring): removed duplicated alerts of dashboard in workbenches - add UWM ns for operator ingress - according to doc: when enable UWM should not have custom Prometheus, this might be the conflicts why we cannot see metrics from odh monitoring in cluster-monitoring prometheus? Signed-off-by: Wen Zhou <wenzhou@redhat.com> * Remove DSCI explicit naming * Fix regression in Prometheus Deployment * Remove os.exit for custom functions * Delete legacy blackbox exporter * fix(monitoring): add missing role and rolebinding for prometheus (red-hat-data-services#112) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(monitoring): missing add new files into kustomization (red-hat-data-services#113) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * cleanup(monitoring): after previous 2 commits this is not needed/useful (red-hat-data-services#114) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(monitoring): do not set odh monitoring namespace when apply for manifests in "monitoring/base" (red-hat-data-services#115) * fix(monitoring): not set our monitoring when apply to monitoring/base folder - hardcode our monitoring namespace for all needed manifests Signed-off-by: Wen Zhou <wenzhou@redhat.com> * revert: label changes made in upgrade PR Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(monitoring): cannot load dashbaord record rules (red-hat-data-services#123) Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(monitoring): when DSC is removed entry in rule_files should be cleanedup - match does not work with * in the string need to use (.*) - add (-) in the front for diffientiate the rule_file or the real rules Signed-off-by: Wen Zhou <wenzhou@redhat.com> * cherry-pick: edson's rhods-12939 from odh + debug + timeout tuning comnent out ExpointialBackoffWithContext for now to test not add v2 into markedDeletion list Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix(upgrade): modelmesh monitoring deployment need deletion as well Signed-off-by: Wen Zhou <wenzhou@redhat.com> * fix: add statefulset Signed-off-by: Wen Zhou <wenzhou@redhat.com> * cherrypick: upstream 748 fix no reconcile when no error return Signed-off-by: Wen Zhou <wenzhou@redhat.com> * RHODS-12956: removing CR update from the operator reconciliation loop to avoid infinite loop (red-hat-data-services#128) * chore Signed-off-by: Wen Zhou <wenzhou@redhat.com> --------- Signed-off-by: Wen Zhou <wenzhou@redhat.com> Co-authored-by: Vaishnavi Hire <vhire@redhat.com> Co-authored-by: Dimitri Saridakis <dimitri.saridakis@gmail.com> Co-authored-by: Edson Tirelli <ed.tirelli@gmail.com> (cherry picked from commit 81ebc87)
Jooho · Nov 29, 2023 · 7525f99 · 7525f99
1 parent 6471658
commit 7525f99
Show file tree

Hide file tree

Showing 30 changed files with 533 additions and 190 deletions.
diff --git a/Makefile b/Makefile
@@ -164,7 +164,7 @@ run: manifests generate fmt vet ## Run a controller from your host.
 	go run ./main.go
 
 .PHONY: image-build
-image-build: unit-test ## Build image with the manager.
+image-build: # unit-test ## Build image with the manager.
 	$(IMAGE_BUILDER) build --no-cache -f Dockerfiles/Dockerfile  ${IMAGE_BUILD_FLAGS} -t $(IMG) .
 
 .PHONY: image-push

diff --git a/bundle/manifests/rhods-operator.clusterserviceversion.yaml b/bundle/manifests/rhods-operator.clusterserviceversion.yaml
@@ -73,9 +73,9 @@ metadata:
           "metadata": {
             "labels": {
               "app.kubernetes.io/created-by": "opendatahub-operator",
-              "app.kubernetes.io/instance": "default-feature",
+              "app.kubernetes.io/instance": "default",
               "app.kubernetes.io/managed-by": "kustomize",
-              "app.kubernetes.io/name": "featuretracker",
+              "app.kubernetes.io/name": "default-feature",
               "app.kubernetes.io/part-of": "opendatahub-operator"
             },
             "name": "default-feature"
@@ -607,6 +607,7 @@ spec:
           verbs:
           - create
           - delete
+          - get
           - list
           - update
           - watch

diff --git a/components/codeflare/codeflare.go b/components/codeflare/codeflare.go
@@ -77,7 +77,7 @@ func (c *CodeFlare) ReconcileComponent(cli client.Client, owner metav1.Object, d
 		}
 
 		if found, err := deploy.OperatorExists(cli, dependentOperator); err != nil {
-			return err
+			return fmt.Errorf("operator exists throws error %v", err)
 		} else if found {
 			return fmt.Errorf("operator %s  found. Please uninstall the operator before enabling %s component",
 				dependentOperator, ComponentName)

diff --git a/components/component.go b/components/component.go
@@ -106,7 +106,7 @@ func (c *Component) UpdatePrometheusConfig(cli client.Client, enable bool, compo
 			DeadManSnitchRules string `yaml:"deadmanssnitch-alerting.rules"`
 			CFRRules           string `yaml:"codeflare-recording.rules"`
 			CRARules           string `yaml:"codeflare-alerting.rules"`
-			DashboardRRules    string `yaml:"rhods-dashboard-recording.rule"`
+			DashboardRRules    string `yaml:"rhods-dashboard-recording.rules"`
 			DashboardARules    string `yaml:"rhods-dashboard-alerting.rules"`
 			DSPRRules          string `yaml:"data-science-pipelines-operator-recording.rules"`
 			DSPARules          string `yaml:"data-science-pipelines-operator-alerting.rules"`

diff --git a/components/modelmeshserving/modelmeshserving.go b/components/modelmeshserving/modelmeshserving.go
@@ -117,7 +117,7 @@ func (m *ModelMeshServing) ReconcileComponent(cli client.Client, owner metav1.Ob
 
 	// For odh-model-controller
 	if enabled {
-		err := cluster.UpdatePodSecurityRolebinding(cli, dscispec.ApplicationsNamespace, "odh-model-controller")
+		err := cluster.UpdatePodSecurityRolebinding(cli, "odh-model-controller", dscispec.ApplicationsNamespace)
 		if err != nil {
 			return err
 		}

diff --git a/components/workbenches/workbenches.go b/components/workbenches/workbenches.go
@@ -138,6 +138,7 @@ func (w *Workbenches) ReconcileComponent(cli client.Client, owner metav1.Object,
 	if enabled {
 		if dscispec.DevFlags.ManifestsUri == "" && len(w.DevFlags.Manifests) == 0 {
 			if platform == deploy.ManagedRhods || platform == deploy.SelfManagedRhods {
+				// for kf-notebook-controller image
 				if err := deploy.ApplyParams(notebookControllerPath, w.SetImageParamsMap(imageParamMap), false); err != nil {
 					return err
 				}

diff --git a/config/monitoring/alertmanager/alertmanager-configs.yaml b/config/monitoring/alertmanager/alertmanager-configs.yaml
@@ -629,6 +629,7 @@ data:
       smtp_require_tls: true
 
     # The root route on which each incoming alert enters.
+    # TODO: check why need email_to
     route:
       group_by: ['alertname', 'cluster', 'service', 'job', 'email_to']
 

diff --git a/config/monitoring/alertmanager/alertmanager-service.yaml b/config/monitoring/alertmanager/alertmanager-service.yaml
@@ -6,7 +6,7 @@ metadata:
   labels:
     name: alertmanager
   name: alertmanager
-  namespace: "redhat-ods-monitoring"
+  namespace: redhat-ods-monitoring
 spec:
   ports:
     - name: alertmanager

diff --git a/config/monitoring/base/kustomization.yaml b/config/monitoring/base/kustomization.yaml
@@ -4,3 +4,5 @@ resources:
 - cluster-monitor-rolebinding.yaml
 - rhods-prometheusrules.yaml
 - rhods-servicemonitor.yaml
+- rhods-prometheus-role.yaml
+- rhods-prometheus-rolebinding.yaml
diff --git a/config/monitoring/base/rhods-prometheus-role.yaml b/config/monitoring/base/rhods-prometheus-role.yaml
@@ -0,0 +1,17 @@
+# this is role for cluster-monitoring to read rhods prometheus service by cluster-monitoring service account
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: rhods-prometheus-cluster-monitoring-viewer
+  namespace: redhat-ods-monitoring
+rules:
+  - verbs:
+      - get
+      - watch
+      - list
+    apiGroups:
+      - ''
+    resources:
+      - pods
+      - services
+      - endpoints
diff --git a/config/monitoring/base/rhods-prometheus-rolebinding.yaml b/config/monitoring/base/rhods-prometheus-rolebinding.yaml
@@ -0,0 +1,14 @@
+# this is rolebingding to rhods-prometheus-cluster-monitoring-viewer for cluster-monitoring to read rhods prometheus service
+kind: RoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: rhods-prometheus-cluster-monitoring-viewer-binding
+  namespace: redhat-ods-monitoring
+subjects:
+  - kind: ServiceAccount
+    name: prometheus-k8s
+    namespace: openshift-monitoring
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: rhods-prometheus-cluster-monitoring-viewer
diff --git a/config/monitoring/base/rhods-prometheusrules.yaml b/config/monitoring/base/rhods-prometheusrules.yaml
@@ -1,3 +1,8 @@
+# rhods_aggregate_availability, rhods_total_users, rhods_actvie_users should not be needed
+# they should be from traditional prometheus pod but from prometheus operator
+# but to get PSI work with some, put it here
+# TODO: revisit if when we decomision customized prometheus instance
+---
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
@@ -6,19 +11,28 @@ metadata:
     role: recording-rules
     app: rhods
   name: rhods-rules
+  namespace: redhat-ods-monitoring
 spec:
   groups:
   - name: rhods-usage.rules
     rules:
     - record: cluster:usage:consumption:rhods:cpu:seconds:rate1h
       expr: sum(rate(container_cpu_usage_seconds_total{container="",pod=~"jupyter-nb.*",namespace="rhods-notebooks"}[1h]))
+      labels:
+        instance: jupyter-notebooks
     - record: cluster:usage:consumption:rhods:pod:up
       expr: count(kube_pod_container_status_ready{namespace="rhods-notebooks", pod=~"jupyter-nb.*",container=~"jupyter-nb-.*"}==1)
+      labels:
+        instance: jupyter-notebooks
     - record: cluster:usage:consumption:rhods:active_users
       expr: count(kube_statefulset_replicas{namespace=~"rhods-notebooks", statefulset=~"jupyter-nb-.*"} ==1)
       labels:
         instance: jupyter-notebooks
     - record: cluster:usage:consumption:rhods:cpu_requests_runtime
       expr: sum(kube_pod_container_resource_requests{namespace="rhods-notebooks",resource="cpu", container=~"jupyter-nb-.*"} * on(pod) kube_pod_status_phase{phase="Running", namespace="rhods-notebooks"})
+      labels:
+        instance: jupyter-notebooks
     - record: cluster:usage:consumption:rhods:cpu_limits_runtime
       expr: sum(kube_pod_container_resource_limits{namespace="rhods-notebooks",resource="cpu", container=~"jupyter-nb-.*"} * on(pod) kube_pod_status_phase{phase="Running", namespace="rhods-notebooks"})
+      labels:
+        instance: jupyter-notebooks
diff --git a/config/monitoring/base/rhods-servicemonitor.yaml b/config/monitoring/base/rhods-servicemonitor.yaml
@@ -2,6 +2,7 @@ apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor
 metadata:
   name: rhods-monitor-federation
+  namespace: redhat-ods-monitoring
   labels:
     monitor-component: rhods-resources
     team: rhods
@@ -27,32 +28,86 @@ spec:
       interval: 30s
   namespaceSelector:
     matchNames:
-      - redhat-ods-monitoring
+      - <odh_monitoring_project>
   selector:
     matchLabels:
       app: prometheus
 ---
 # servicemonitoring for rhods operator
+# this is not in use, we need to implement operator metrics in logic first
+# apiVersion: monitoring.coreos.com/v1
+# kind: ServiceMonitor
+# metadata:
+#   labels:
+#     control-plane: controller-manager
+#   name: rhods-controller-manager-metrics-monitor
+#   namespace: redhat-ods-operator
+# spec:
+#   endpoints:
+#     - path: /metrics
+#       port: metrics
+#       scheme: https
+#       bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+#       tlsConfig:
+#         insecureSkipVerify: true
+#       params:
+#         'match[]':
+#           - '{__name__= "redhat-ods-operator-controller-manager-metrics-service"}'
+#   namespaceSelector:
+#     matchNames:
+#       - redhat-ods-operator
+#   selector:
+#     matchLabels:
+#       control-plane: controller-manager
+
+---
+# servicemonitoring for openshift-monitoring scrap
+# move from modelmesh-monitoring
+# this one is duplicated as the old modelmesh-federated-metrics
+# in order to keep metrics there if user set modelmesh to Removed
 apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor
 metadata:
+  name: rhods-monitor-federation2
+  namespace: redhat-ods-monitoring
   labels:
-    control-plane: controller-manager
-  name: rhods-controller-manager-metrics-monitor
-  namespace: redhat-ods-operator
+    monitor-component: rhods-resources
+    team: rhods
 spec:
   endpoints:
-    - path: /metrics
-      port: '8080'
-      scheme: https
+    - interval: 30s
+      params:
+        'match[]':
+          - '{__name__= "haproxy_backend_http_average_response_latency_milliseconds"}'
+          - '{__name__= "haproxy_backend_http_responses_total"}'
+          - '{__name__= "container_cpu_usage_seconds_total"}'
+          - '{__name__= "container_memory_working_set_bytes"}'
+          - '{__name__= "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate"}'
+          - '{__name__= "cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits"}'
+          - '{__name__= "cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests"}'
+          - '{__name__= "cluster:namespace:pod_memory:active:kube_pod_container_resource_requests"}'
+          - '{__name__= "cluster:namespace:pod_memory:active:kube_pod_container_resource_limits"}'
+          - '{__name__= "kube_persistentvolumeclaim_resource_requests_storage_bytes"}'
+          - '{__name__= "kubelet_volume_stats_used_bytes"}'
+          - '{__name__= "kubelet_volume_stats_capacity_bytes"}'
+      honorLabels: true
+      scrapeTimeout: 10s
       bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
+      bearerTokenSecret:
+        key: ""
+      path: /federate
+      port: web
+      scheme: https
       tlsConfig:
+        ca: {}
+        cert: {}
         insecureSkipVerify: true
   namespaceSelector:
     matchNames:
-      - redhat-ods-operator
+      - openshift-monitoring
   selector:
     matchLabels:
-      control-plane: controller-manager
-
-
+      app.kubernetes.io/component: prometheus
+      app.kubernetes.io/instance: k8s
+      app.kubernetes.io/name: prometheus
+      app.kubernetes.io/part-of: openshift-monitoring
diff --git a/config/monitoring/networkpolicy/monitoring/monitoring.yaml b/config/monitoring/networkpolicy/monitoring/monitoring.yaml
@@ -2,6 +2,8 @@
 # the services residing in redhat-ods-monitoring. namespaceSelector
 # ensures that traffic from only the desired namespaces is allowed
 # 9114 for blackbox or user_facing_endpoints* all down
+# 9115 for blackbox health
+# 10443 and 9091 for web
 ---
 kind: NetworkPolicy
 apiVersion: networking.k8s.io/v1
@@ -30,5 +32,17 @@ spec:
         - namespaceSelector:
             matchLabels:
               kubernetes.io/metadata.name: openshift-monitoring
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: openshift-user-workload-monitoring
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: redhat-ods-operator
+        - namespaceSelector:
+            matchLabels:
+              opendatahub.io/generated-namespace: "true"
+  egress:
+  - {}
   policyTypes:
-    - Ingress
+    - Ingress
+    - Egress
diff --git a/config/monitoring/networkpolicy/operator/operator.yaml b/config/monitoring/networkpolicy/operator/operator.yaml
@@ -16,5 +16,11 @@ spec:
         - namespaceSelector:
             matchLabels:
               kubernetes.io/metadata.name: openshift-monitoring
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: openshift-user-workload-monitoring
+        - namespaceSelector:
+            matchLabels:
+              opendatahub.io/generated-namespace: "true"
   policyTypes:
     - Ingress