Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/metricstransform] Aggregation skipped when label_set is empty but not nil #34430

Open
tiffanny29631 opened this issue Aug 6, 2024 · 3 comments
Labels
bug Something isn't working needs triage New item requiring triage processor/metricstransform Metrics Transform processor

Comments

@tiffanny29631
Copy link

Component(s)

processor/metricstransform

What happened?

Description

We are upgrading from 0.102.0 to 0.106.0 for CVE fixes and encountered a behavior change in the mentioned processor. With the configuration we have the expectation is no label remains for the metric after aggregation operation, but now all labels fall through and get exported.

When looking at the code previous filter skips if label_set is nil and refactored code skips if length is 0.

I did not find the change of this behavior in the release note, was it intentional? If yes, would omitting this field work? I've tried explicitly setting label_set to nil and the filter is still not working as expected.

Steps to Reproduce

Install otelconribcol 0.102.0 with our ConfigMap, everything works fine;

Upgrade to 0.106.0, the metric that has labe_set:[] starts to getting rejected by our back end as unrecognized labels received.

Expected Result

With label_set:[] all labels associated with a metric are filtered out.

Actual Result

With label_set:[] all labels associated with a metric remain.

Collector version

0.106.0

Environment information

Environment

GKE cluster 1.29.6-gke.1254000

OpenTelemetry Collector configuration

apiVersion: v1
data:
  otel-collector-config.yaml: |-
    receivers:
      opencensus:
    exporters:
      prometheus:
        endpoint: :8675
        namespace: config_sync
        resource_to_telemetry_conversion:
          enabled: true
      googlecloud:
        metric:
          prefix: "custom.googleapis.com/opencensus/config_sync/"
          # The exporter would always fail at sending metric descriptor. Skipping
          # creation of metric descriptors until the error from upstream is resolved
          # The metric streaming data is not affected
          # https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/issues/529
          skip_create_descriptor: true
          # resource_filters looks for metric resource attributes by prefix and converts
          # them into custom metric labels, so they become visible and can be accessed
          # under the GroupBy dropdown list in Cloud Monitoring
          resource_filters:
            - prefix: "cloud.account.id"
            - prefix: "cloud.availability.zone"
            - prefix: "cloud.platform"
            - prefix: "cloud.provider"
            - prefix: "k8s.pod.ip"
            - prefix: "k8s.pod.namespace"
            - prefix: "k8s.pod.uid"
            - prefix: "k8s.container.name"
            - prefix: "host.id"
            - prefix: "host.name"
            - prefix: "k8s.deployment.name"
            - prefix: "k8s.node.name"
        sending_queue:
          enabled: false
      googlecloud/kubernetes:
        metric:
          prefix: "kubernetes.io/internal/addons/config_sync/"
          # skip_create_descriptor: Metrics start with 'kubernetes.io/' have already
          # got descriptors defined internally. Skip sending dupeicated metric
          # descriptors here to prevent errors or conflicts.
          skip_create_descriptor: true
          # instrumentation_library_labels: Otel Collector by default attaches
          # 'instrumentation_version' and 'instrumentation_source' labels that are
          # not specified in our Cloud Monarch definitions, thus skipping them here
          instrumentation_library_labels: false
          # create_service_timeseries: This is a recommended configuration for
          # 'service metrics' starts with 'kubernetes.io/' prefix. It uses
          # CreateTimeSeries API and has its own quotas, so that custom metric write
          # will not break this ingestion pipeline
          create_service_timeseries: true
          service_resource_labels: false
        sending_queue:
          enabled: false
    processors:
      batch:
      # resourcedetection: This processor is needed to correctly mirror resource
      # labels from OpenCensus to OpenTelemetry. We also want to keep this same
      # processor in Otel Agent configuration as the resource labels are added from
      # there
      resourcedetection:
        detectors: [env, gcp]
      filter/cloudmonitoring:
        metrics:
          include:
            # Use strict match type to ensure metrics like 'kustomize_resource_count'
            # is excluded
            match_type: strict
            metric_names:
              - reconciler_errors
              - apply_duration_seconds
              - reconcile_duration_seconds
              - rg_reconcile_duration_seconds
              - last_sync_timestamp
              - pipeline_error_observed
              - declared_resources
              - apply_operations_total
              - resource_fights_total
              - internal_errors_total
              - kcc_resource_count
              - resource_count
              - ready_resource_count
              - cluster_scoped_resource_count
              - resource_ns_count
              - api_duration_seconds
      # Aggregate some metrics sent to Cloud Monitoring to remove high-cardinality labels (e.g. "commit")
      metricstransform/cloudmonitoring:
        transforms:
          - include: last_sync_timestamp
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                  - status
                aggregation_type: max
          - include: declared_resources
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                aggregation_type: max
          - include: apply_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set:
                  - configsync.sync.kind
                  - configsync.sync.name
                  - configsync.sync.namespace
                  - status
                aggregation_type: max
      filter/kubernetes:
        metrics:
          include:
            match_type: regexp
            metric_names:
              - kustomize.*
              - api_duration_seconds
              - reconciler_errors
              - pipeline_error_observed
              - reconcile_duration_seconds
              - rg_reconcile_duration_seconds
              - parser_duration_seconds
              - declared_resources
              - apply_operations_total
              - apply_duration_seconds
              - resource_fights_total
              - remediate_duration_seconds
              - resource_conflicts_total
              - internal_errors_total
              - kcc_resource_count
              - last_sync_timestamp
      # Transform the metrics so that their names and labels are aligned with definition in go/config-sync-monarch-metrics
      metricstransform/kubernetes:
        transforms:
          - include: api_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                # label_set is the allowlist of labels to keep after aggregation
                label_set: [status, operation]
                aggregation_type: max
          - include: declared_resources
            action: update
            new_name: current_declared_resources
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: kcc_resource_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [resourcegroup]
                aggregation_type: max
          - include: reconciler_errors
            action: update
            new_name: last_reconciler_errors
            operations:
              - action: aggregate_labels
                label_set: [component, errorclass]
                aggregation_type: max
          - include: reconcile_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: rg_reconcile_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [stallreason]
                aggregation_type: max
          - include: last_sync_timestamp
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: parser_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status, source, trigger]
                aggregation_type: max
          - include: pipeline_error_observed
            action: update
            new_name: last_pipeline_error_observed
            operations:
              - action: aggregate_labels
                label_set: [name, component, reconciler]
                aggregation_type: max
          - include: apply_operations_total
            action: update
            new_name: apply_operations_count
            operations:
              - action: aggregate_labels
                label_set: [controller, operation, status]
                aggregation_type: max
          - include: apply_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: resource_fights_total
            action: update
            new_name: resource_fights_count
            operations:
              - action: aggregate_labels
                label_set: [name, component, reconciler]
                aggregation_type: max
          - include: resource_conflicts_total
            action: update
            new_name: resource_conflicts_count
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: internal_errors_total
            action: update
            new_name: internal_errors_count
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: remediate_duration_seconds
            action: update
            operations:
              - action: aggregate_labels
                label_set: [status]
                aggregation_type: max
          - include: kustomize_field_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [field_name]
                aggregation_type: max
          - include: kustomize_deprecating_field_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [deprecating_field]
                aggregation_type: max
          - include: kustomize_simplification_adoption_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [simplification_field]
                aggregation_type: max
          - include: kustomize_builtin_transformers
            action: update
            operations:
              - action: aggregate_labels
                label_set: [k8s_metadata_transformer]
                aggregation_type: max
          - include: kustomize_helm_inflator_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [helm_inflator]
                aggregation_type: max
          - include: kustomize_base_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [base_source]
                aggregation_type: max
          - include: kustomize_patch_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: [patch_field]
                aggregation_type: max
          - include: kustomize_ordered_top_tier_metrics
            action: update
            operations:
              - action: aggregate_labels
                label_set: [top_tier_field]
                aggregation_type: max
          - include: kustomize_resource_count
            action: update
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
          - include: kustomize_build_latency
            action: update
            operations:
              - action: aggregate_labels
                label_set: []
                aggregation_type: max
    extensions:
      health_check:
    service:
      extensions: [health_check]
      pipelines:
        metrics/cloudmonitoring:
          receivers: [opencensus]
          processors: [batch, filter/cloudmonitoring, metricstransform/cloudmonitoring, resourcedetection]
          exporters: [googlecloud]
        metrics/prometheus:
          receivers: [opencensus]
          processors: [batch]
          exporters: [prometheus]
        metrics/kubernetes:
          receivers: [opencensus]
          processors: [batch, filter/kubernetes, metricstransform/kubernetes, resourcedetection]
          exporters: [googlecloud/kubernetes]
kind: ConfigMap
metadata:
  labels:
    app: opentelemetry
    component: otel-collector
    configmanagement.gke.io/arch: csmr
    configmanagement.gke.io/system: "true"
  name: otel-collector-googlecloud
  namespace: config-management-monitoring

Log output

No response

Additional context

No response

@tiffanny29631 tiffanny29631 added bug Something isn't working needs triage New item requiring triage labels Aug 6, 2024
@github-actions github-actions bot added the processor/metricstransform Metrics Transform processor label Aug 6, 2024
Copy link
Contributor

github-actions bot commented Aug 6, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@main--
Copy link

main-- commented Aug 10, 2024

I'm observing the same problem. I was wondering why the Honeycomb-provided config was reporting 640 distinct CPU events, and it turns out that all the aggregations are just being ignored. I resolved the problem by downgrading to Version 0.105.0 - now those 640 events are aggregated down to a single one. I'm guessing that #33655 introduced this bug, seeing as it replaced the event aggregation functions with a completely different code path.

@tiffanny29631
Copy link
Author

A potential hack: Latest observation if v0.106 is using a no_op_label instead of label_set:[] or omitting the field can bypass the issue.

Looking at https://github.com/open-telemetry/opentelemetry-collector/blob/main/pdata/pcommon/slice.go#L125 if a given element in label_set does not exist in the original slice, f would still return true indicating the label is to be removed but the re-slice would ignore as the element cannot be found, meanwhile other labels could get aggregated out as designed.

tiffanny29631 added a commit to tiffanny29631/kpt-config-sync that referenced this issue Aug 13, 2024
…erTools#1369)" (GoogleContainerTools#1376)

Due to a refactor in the upstream, the metric transform processor no longer takes `label_set:[]` as a hint to remove all labels, instead all labels fall through and get exported. See [issue](open-telemetry/opentelemetry-collector-contrib#34430) for details.

This change is to reapply the upgrade to 0.106.0 to catch up with latest otelcollector but adding the following workaround:

A label `no_op_label` is given to the `label_set` of the metrics that need all labels removed. In this case the [re-slice function in the core lib](https://github.com/open-telemetry/opentelemetry-collector/blob/main/pdata/pcommon/slice.go#L125) will filter out all other valid labels of the metric. As for the no_op_label nothing will be performed as it's not part of the original attributes.


This reverts commit 1d33f46.
tiffanny29631 added a commit to tiffanny29631/kpt-config-sync that referenced this issue Aug 16, 2024
This change updates otelcoontribcol to latest and modifies to fit breaking changes from 0.104.0 and 0.106.0.

Breaking change in 0.104.0 https://github.com/open-telemetry/opentelemetry-collector-contrib/releases/tag/v0.104.0
Breaking change in 0.106.0 open-telemetry/opentelemetry-collector-contrib#34430

- Localhost is now the default setting, while otel-agent and otel-collector require 0.0.0.0, so the feature gate has been removed.
- The format of the environment variable was updated to meet the new syntax requirements. The otel-agent ConfigMap was split between the reconciler and controllers, ensuring that sync-related labels are only applied to reconcilers.
- A `no_op_label` has been added to ensure that the aggregation in the metricstransform processor filters on all metric labels. This is a temporary workaround until a permanent fix is implemented upstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage processor/metricstransform Metrics Transform processor
Projects
None yet
Development

No branches or pull requests

2 participants