Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1958367: Add recording rule for builds by strategy #1136

Merged

Conversation

adambkaplan
Copy link
Contributor

@adambkaplan adambkaplan commented Apr 27, 2021

Aggregate the number of builds run on a cluster by strategy. This will
help the Build API team determine which build strategies are actively
being used on a cluster. This will produce a time series with the
following strategy labels:

  • docker
  • source
  • jenkinspipeline
  • custom
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@adambkaplan
Copy link
Contributor Author

If possible, we'd like this rule backported to 4.6 so we can measure build usage in the active fleet.

@adambkaplan
Copy link
Contributor Author

/cc @siamaksade

@simonpasquier
Copy link
Contributor

/hold
until the CI is fixed

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adambkaplan, siamaksade
To complete the pull request process, please assign dgrisonnet after the PR has been reviewed.
You can assign the PR to them by writing /assign @dgrisonnet in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adambkaplan
Copy link
Contributor Author

Reviewer note - recording rules with cluster:usage prefix are automatically exported to Telemeter.

@simonpasquier should I open a BZ so we can backport the recording rule?

@@ -485,6 +485,10 @@ local droppedKsmLabels = 'endpoint, instance, job, pod, service';
expr: 'sum(openshift_build_total{job="kubernetes-apiservers",phase="Error"})/(sum(openshift_build_total{job="kubernetes-apiservers",phase=~"Failed|Complete|Error"}))',
record: 'build_error_rate',
},
{
expr: 'sum by (strategy) (openshift_build_status_phase_total)',
record: 'cluster:usage:openshift:build_by_strategy:sum',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dgrisonnet
Copy link
Member

@adambkaplan, we will need a BZ for 4.8, 4.7 and 4.6.

Since this PR is adding a new recording rule to telemeter, you'll need to add a comment about which team owns this rule as in https://github.com/openshift/cluster-monitoring-operator/blob/master/manifests/0000_50_cluster-monitoring-operator_04-config.yaml. The rule will also need to be validated by @smarterclayton before we can proceed.

On a side note about allowing '{__name__=~"cluster:usage:.*"} in the telemeter config. I am not aware of the exact context behind this, but wouldn't it be better to allow each rule manually instead of grouping them with this regex? The benefit would be to improve the tractability of the ownership of these rules which is currently disregarded. cc @simonpasquier

Unholding since the CI has been fixed.
/unhold

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2021
@adambkaplan adambkaplan changed the title Add recording rule for builds by strategy Bug Bug 1958367: Add recording rule for builds by strategy May 7, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 7, 2021

@adambkaplan: This pull request references Bugzilla bug 1958367, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @xiuwang

In response to this:

Bug Bug 1958367: Add recording rule for builds by strategy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 7, 2021
@openshift-ci openshift-ci bot requested a review from xiuwang May 7, 2021 18:11
@adambkaplan adambkaplan changed the title Bug Bug 1958367: Add recording rule for builds by strategy Bug 1958367: Add recording rule for builds by strategy May 7, 2021
@adambkaplan
Copy link
Contributor Author

/assign @smarterclayton

@dgrisonnet given that I'll need to add a comment to the telemeter config, I've renamed the rule so that we explicitly need to export it.

@adambkaplan
Copy link
Contributor Author

@simonpasquier ran into this issue trying to get make generate happy. #1153

@xiuwang
Copy link

xiuwang commented May 11, 2021

@adambkaplan Add a case for this feature OCP-41751 .

Test this feature on cluster launched from pr, the metrics works well.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 11, 2021
Aggregate the number of builds run on a cluster by strategy, and export
the recording rule to Telemeter. This will help the Build API team
determine which build strategies are actively being used on a cluster.

The recording rule will produce a time series with the following
strategy labels:

- docker
- source
- jenkinspipeline
- custom

Updated documentation and telemeter_query
@adambkaplan
Copy link
Contributor Author

@openshift/openshift-team-monitoring ptal

@adambkaplan
Copy link
Contributor Author

/retest

@adambkaplan
Copy link
Contributor Author

/retest

@adambkaplan
Copy link
Contributor Author

Can we skip the single-node test for this PR?

@adambkaplan
Copy link
Contributor Author

/retest

@dgrisonnet
Copy link
Member

@adambkaplan the single-node job is optional

@dgrisonnet
Copy link
Member

/lgtm

Leaving approval to @smarterclayton.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 27, 2021
@dgrisonnet
Copy link
Member

/remove-approve

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2021
@@ -368,6 +368,8 @@ spec:
rules:
- expr: sum(openshift_build_total{job="kubernetes-apiservers",phase="Error"})/(sum(openshift_build_total{job="kubernetes-apiservers",phase=~"Failed|Complete|Error"}))
record: build_error_rate
- expr: sum by (strategy) (openshift_build_status_phase_total)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one issue with summing counters is that the numbers would be off when a counter resets. It is recommended to rate/increase then sum rather.

Suggested change
- expr: sum by (strategy) (openshift_build_status_phase_total)
- expr: sum by (strategy) (rate(openshift_build_status_phase_total[5m]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is phase_total a counter? I thought it was a gauge? Agree if it's not a gauge this is a bit wierd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can tell from the prometheus target metadata, it is a gauge. @adambkaplan usually the naming convention for Prometheus says that the _total suffix is reserved to counters. The correct naming for your metric should be openshift_build_status_phases since it's a gauge.
https://prometheus.io/docs/practices/naming/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric is collected via openshift-state-metrics - it's been around since 3.x and is marked STABLE in our GitHub docs. I'm fairly certain this name was chosen before the naming conventions were codified upstream.

https://github.com/openshift/openshift-state-metrics/blob/master/docs/build-metrics.md

@smarterclayton
Copy link
Contributor

Cardinality and appropriateness to be sent to telemetry approved, consider my part of approval granted.

@dgrisonnet
Copy link
Member

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 1, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, dgrisonnet, siamaksade

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2021
@dgrisonnet
Copy link
Member

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 1, 2021

@adambkaplan: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-single-node ec26e16 link /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 1c0ce1c into openshift:master Jun 1, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 1, 2021

@adambkaplan: All pull requests linked via external trackers have merged:

Bugzilla bug 1958367 has been moved to the MODIFIED state.

In response to this:

Bug 1958367: Add recording rule for builds by strategy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@adambkaplan
Copy link
Contributor Author

/cherry-pick release-4.7

@openshift-cherrypick-robot

@adambkaplan: #1136 failed to apply on top of branch "release-4.7":

Applying: Add recording rule for builds by strategy
Using index info to reconstruct a base tree...
M	CHANGELOG.md
M	Documentation/data-collection.md
A	assets/cluster-monitoring-operator/prometheus-rule.yaml
A	jsonnet/rules.libsonnet
M	manifests/0000_50_cluster-monitoring-operator_04-config.yaml
Falling back to patching base and 3-way merge...
Auto-merging manifests/0000_50_cluster-monitoring-operator_04-config.yaml
CONFLICT (content): Merge conflict in manifests/0000_50_cluster-monitoring-operator_04-config.yaml
Auto-merging jsonnet/rules.jsonnet
CONFLICT (modify/delete): assets/cluster-monitoring-operator/prometheus-rule.yaml deleted in HEAD and modified in Add recording rule for builds by strategy. Version Add recording rule for builds by strategy of assets/cluster-monitoring-operator/prometheus-rule.yaml left in tree.
Auto-merging Documentation/data-collection.md
CONFLICT (content): Merge conflict in Documentation/data-collection.md
Auto-merging CHANGELOG.md
CONFLICT (content): Merge conflict in CHANGELOG.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add recording rule for builds by strategy
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.