Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MULTIARCH-4294: Implement prometheus metrics for the pod placement controller #257

Merged
merged 15 commits into from
Sep 17, 2024

Conversation

aleskandro
Copy link
Member

@aleskandro aleskandro commented Sep 5, 2024

This PR implements the Prometheus metrics for the Pod placement controller.

As the kube-rbac-proxy is a main component to serve metrics and it is being deprecated, this PR also addresses MULTIARCH-4989.

See kubernetes-sigs/kubebuilder#3871 for further details.

Metrics are documented in the docs/metrics.md file and an example Grafana Dashboard is provided here.

Finally, this PR implements some performance optimizations:

  • Concurrent reconciles to improve the processed pods rate
  • Caching and reconciling only Pending pods
  • Resync period and (concurrent-safe) update of the global pull-secret only when it changes

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 5, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 5, 2024

@aleskandro: This pull request references MULTIARCH-4294 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 5, 2024
@aleskandro aleskandro force-pushed the multiarch-4294 branch 5 times, most recently from 87e8bd2 to 9a6a768 Compare September 6, 2024 19:17
Images provided under gcr.io/kubebuilder/ will be unavailable from March 18, 2025.
Projects initialized with Kubebuilder versions v3.14 or lower utilize gcr.io/kubebuilder/kube-rbac-proxy to protect the metrics endpoint.

Following the work in kubernetes-sigs/kubebuilder#4003, this commit removes the kube-rbac-proxy container and let the main container of the controller expose the metrics via HTTPS and by using the WithAuthenticatoinAndAuthorization filter.

This also includes a minor fix in BuildService escaped during the resolution of some conflicts during a rebase.

Related to kubernetes-sigs/kubebuilder#3871
@aleskandro aleskandro force-pushed the multiarch-4294 branch 9 times, most recently from f984ac3 to 82c5212 Compare September 10, 2024 23:21
@aleskandro
Copy link
Member Author

/test e2e-gcp-multi-operator-olm

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 11, 2024

@aleskandro: This pull request references MULTIARCH-4294 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

This PR implements the Prometheus metrics for the Pod placement controller.

As the kube-rbac-proxy is a main component to serve metrics and it is being deprecated, this PR also addresses MULTIARCH-4989.

See kubernetes-sigs/kubebuilder#3871 for further details.

Metrics are documented in the docs/metrics.md file and an example Grafana Dashboard is provided here.

Finally, this PR implements some performance optimizations:

  • Concurrent reconciles to improve the processed pods rate
  • Caching and reconciling only Pending pods
  • Resync period and (concurrent-safe) update of the global pull-secret only when it changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@aleskandro
Copy link
Member Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2024

-- Distribution of the time to inspect an image
sum by(le) (rate(mto_ppo_ctrl_time_to_inspect_pod_images_seconds_bucket[5m]))

Copy link
Contributor

@Prashanth684 Prashanth684 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also add a sample query for users to set up alerts based on hitting SLOs. for example:

sum(rate(mto_ppo_ctrl_time_to_process_gated_pod_seconds_bucket{le="0.3"}[5m])) by (job)
/sum(rate(mto_ppo_wh_pods_gated_total[5m])) by (job) >=0.90

for a 90% SLO for a targeted duration of 300ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Prashanth684 , I would ignore AlertRule and suggestions of alerts in this PR. I was starting to look at them, but I would collect more data and have the inspection caching implemented first. Otherwise, we risk false positives alerts that pop up for the users easily.

@aleskandro
Copy link
Member Author

/test e2e-gcp-multi-operator-olm

@Prashanth684
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2024
@aleskandro
Copy link
Member Author

/approve

Copy link

openshift-ci bot commented Sep 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleskandro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2024
@aleskandro
Copy link
Member Author

/test e2e-gcp-multi-operator-olm

1 similar comment
@aleskandro
Copy link
Member Author

/test e2e-gcp-multi-operator-olm

@aleskandro
Copy link
Member Author

/retest

@aleskandro
Copy link
Member Author

/override "Red Hat Konflux / multiarch-tuning-operator-enterprise-contract / multiarch-tuning-operator"

Copy link

openshift-ci bot commented Sep 16, 2024

@aleskandro: Overrode contexts on behalf of aleskandro: Red Hat Konflux / multiarch-tuning-operator-enterprise-contract / multiarch-tuning-operator

In response to this:

/override "Red Hat Konflux / multiarch-tuning-operator-enterprise-contract / multiarch-tuning-operator"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aleskandro
Copy link
Member Author

/test e2e-gcp-multi-operator-olm

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD dcc9a38 and 2 for PR HEAD dfd55aa in total

2 similar comments
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD dcc9a38 and 2 for PR HEAD dfd55aa in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD dcc9a38 and 2 for PR HEAD dfd55aa in total

Copy link

openshift-ci bot commented Sep 17, 2024

@aleskandro: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2b3f70d into openshift:main Sep 17, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants