Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel googlemanagedprometheus can't connect to Google Cloud Monitoring from Cloud Run #31374

Open
matthewcyy opened this issue Feb 21, 2024 · 21 comments
Assignees
Labels
bug Something isn't working exporter/googlemanagedprometheus Google Managed Prometheus exporter Stale

Comments

@matthewcyy
Copy link

matthewcyy commented Feb 21, 2024

Component(s)

OTel Collector, googlemanagedprometheus exporter

What happened?

Description

With OTel images otel/opentelemetry-collector-contrib:0.89.0 and otel/opentelemetry-collector-contrib:0.87.0, I cannot see metrics being exported to Google Cloud Monitoring via googlemanagedprometheus. It was working fine for a few months until a couple weeks ago, but there haven't been any changes to the config or permissions.

Steps to Reproduce

Not very clear. There's some kind of regression since collector was exporting to Google Cloud monitoring just fine before but there weren't any changes to the collector config. It was originally 0.87 and I tried updating to 0.89 since the k8s OTel collector was using 0.89 and not having these issues.

Expected Result

Metrics visible in Google Cloud Monitoring

Actual Result

Metrics are not visible in Google Cloud Monitoring. Collector is exporting through the logging exporter though, and starting and stopping healthily.

Collector version

Docker images 0.87 & 0.89

Environment information

Environment

Following these steps https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar#ship-code

OpenTelemetry Collector configuration

# configuration file for OTEL collector
receivers:
  otlp:
    protocols:
      grpc:

processors:
  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 5s

  memory_limiter:
    # drop metrics if memory usage gets too high
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20

  # automatically detect Cloud Run resource metadata, needed for googlemanagedprometheus
  resourcedetection:
    detectors: [env, gcp]
    timeout: 2s
    override: false

  resource:
    attributes:
      # add instance_id as a resource attribute
      - key: service.instance.id
        from_attribute: faas.id
        action: upsert
        # parse service name from K_SERVICE Cloud Run variable
      - key: service.name
        value: ${env:K_SERVICE}
        action: insert

exporters:
  googlemanagedprometheus:
  logging:
  # debug:
  #   verbosity: detailed

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, memory_limiter, resourcedetection, resource]
      exporters: [googlemanagedprometheus, logging]
  telemetry:
    logs:
      level: "debug"

Log output

I see these logs from zapgrpc which may be indicating the issue (these were with image 0.89 and had logs: level: debug):

DEFAULT 2024-02-21T16:53:42.825122Z 2024-02-21T16:53:42.825Z info zapgrpc/zapgrpc.go:178 [transport] [client-transport 0xc001b96480] loopyWriter exiting with error: finished processing active streams while in draining mode {"grpc_log": true}
DEFAULT 2024-02-21T16:53:42.825288Z 2024-02-21T16:53:42.825Z info zapgrpc/zapgrpc.go:178 [transport] [client-transport 0xc001b96480] Closing: connection error: desc = "error reading from server: read tcp [***]:45499->[***]:443: use of closed network connection" {"grpc_log": true}
DEFAULT 2024-02-21T16:57:41.924188Z 2024-02-21T16:57:41.924Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc001ba1d40] Closing: read tcp [::1]:4317->[::1]:48173: read: connection timed out {"grpc_log": true}
DEFAULT 2024-02-21T16:57:45.128007Z 2024-02-21T16:57:45.127Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc00108b520] Closing: read tcp [::1]:4317->[::1]:16167: read: connection timed out {"grpc_log": true}
DEFAULT 2024-02-21T16:57:45.128068Z 2024-02-21T16:57:45.128Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc00108b520] loopyWriter exiting with error: transport closed by client {"grpc_log": true}
DEFAULT 2024-02-21T16:57:45.128107Z 2024-02-21T16:57:45.128Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc002c341a0] Closing: read tcp [::1]:4317->[::1]:24082: read: connection timed out {"grpc_log": true}
DEFAULT 2024-02-21T16:57:45.128157Z 2024-02-21T16:57:45.128Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc002c341a0] loopyWriter exiting with error: transport closed by client {"grpc_log": true}
DEFAULT 2024-02-21T16:57:45.128176Z 2024-02-21T16:57:45.128Z info zapgrpc/zapgrpc.go:178 [transport] [server-transport 0xc001ba1d40] loopyWriter exiting with error: transport closed by client {"grpc_log": true}

Previously with image 0.87 and no logs: level: debug, there were these logs:

zapgrpc/zapgrpc.go:195	[core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "monitoring.googleapis.com:443", ServerName: "monitoring.googleapis.com:443", }. Err: connection error: desc = "transport: authentication handshake failed: context deadline exceeded"	{"grpc_log": true}


### Additional context

For more context, we're using this OTel collector as a sidecar container in our Cloud Run service and seeing these issues. We have a separate k8s collector as well for other services, which aren't seeing these issues. The k8s collector is writing to the same project's Google Cloud Monitoring as the sidecar collector We followed these steps to setup the OTel collector as a sidecar container https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar#ship-code
@matthewcyy matthewcyy added bug Something isn't working needs triage New item requiring triage labels Feb 21, 2024
@crobert-1 crobert-1 added the exporter/googlemanagedprometheus Google Managed Prometheus exporter label Feb 26, 2024
Copy link
Contributor

Pinging code owners for exporter/googlemanagedprometheus: @aabmass @dashpole @jsuereth @punya @damemi @psx95. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

Are you able to reproduce the issue with the sample application?

@dashpole dashpole self-assigned this Feb 26, 2024
@dashpole dashpole removed the needs triage New item requiring triage label Feb 26, 2024
@dashpole
Copy link
Contributor

v0.87.0 had an issue with the exporter's logging: GoogleCloudPlatform/opentelemetry-operations-go#761, but that should be fixed in v0.89.0. Updating to a newer version might help.

I don't think the exporter uses zap, so those logs are probably from a different component.

I also see you are using the logging exporter. Do you see the metrics from your application being logged? If not, that might suggest problems getting the metrics to the collector, rather than problems exporting.

@matthewcyy
Copy link
Author

matthewcyy commented Feb 27, 2024

@dashpole Thanks for the response! We see metrics being logged from the logging exporter so I don't think it's an inherent issue with metrics getting to the collector. I actually noticed when we do not have minimum number of instances setup on Cloud Run then it collects metrics fine. But if we have the minimum number of instances setup in Cloud Run (autoscaling.knative.dev/minScale: "1"), then metrics are not exported to Google Cloud monitoring properly and many will be missed. Would you happen to know why or if this has been seen before?

Here's a screenshot from our Grafana capturing GCM where we group by the metric's resouce.attributes.service.instance.id from Cloud Run. We were not using minimum number of instances until 02/09, and I retried using having no minimum instances again in the evening of 02/26 and it began collecting metrics fine again. I haven't tried reproducing this with the sample app yet

image

@dashpole dashpole changed the title OTel googlemanagedprometheus can't connect to Google Cloud Monitoring OTel googlemanagedprometheus can't connect to Google Cloud Monitoring from Cloud Run Feb 27, 2024
@dashpole
Copy link
Contributor

cc @braydonk @ridwanmsharif in-case have you seen any issues before with minScale.

@matthewcyy can you mention you've seen the issue on v0.87 and v0.89. Did the issue appear when you upgraded from a previous version of the collector? Do you know what version that was? Or were you always on v0.87, and the minScale change triggered the issue?

For scaling related issues, i've sometimes found that removing batching and queueing can solve it (by making the pipeline synchronous). Are you able to try removing the batch processor and setting sending_queue: enabled: false?

@matthewcyy
Copy link
Author

matthewcyy commented Feb 27, 2024

@dashpole Yes this issue was seen on v0.87 and v0.89. The issue only appeared once we changed the minimum number of instances, but we weren't aware at the time so we tried upgrading to v0.89 which didn't resolve the issue. Will try removing the batch processor as well and let you know, the config file should look like this then correct?

# configuration file for OTEL collector
receivers:
  otlp:
    protocols:
      grpc:

processors:
  memory_limiter:
    # drop metrics if memory usage gets too high
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20

  # automatically detect Cloud Run resource metadata, needed for googlemanagedprometheus
  resourcedetection:
    detectors: [env, gcp]
    timeout: 2s
    override: false

  resource:
    attributes:
      # add instance_id as a resource attribute
      - key: service.instance.id
        from_attribute: faas.id
        action: upsert
        # parse service name from K_SERVICE Cloud Run variable
      - key: service.name
        value: ${env:K_SERVICE}
        action: insert

exporters:
  googlemanagedprometheus:
    sending_queue:
      enabled: false
  logging:
  # debug:
  #   verbosity: detailed

extensions:
  health_check:

service:
  extensions: [health_check]
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, resource]
      exporters: [googlemanagedprometheus, logging]
  telemetry:
    logs:
      level: "debug"

@matthewcyy
Copy link
Author

We're receiving this error now when we have sending_queue disabled:

2024-03-01T18:17:02.280Z	error	exporterhelper/common.go:213	Exporting failed. Rejecting data. Try enabling
 sending_queue to survive temporary failures.	{"kind": "exporter", "data_type": "metrics", "name": 
"googlemanagedprometheus", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be 
written: Points must be written in order. One or more of the points specified had an older start time than the most recent 
point.: prometheus_target{cluster:,job:sequalizer-service,location:us-
east1,instance:00d087f5e0402006ae47bc8a31b2c0d8d4bcf5ba3df7c2bc60946ffaf64d6e2fc4e4249bc7d64b1c70c60e44
5fda5d5fc4b0c246dec4f3cdd76ed81382ca0e,namespace:} timeSeries[0-14]: 
prometheus.googleapis.com/requests_total/counter{otel_scope_version:,dataset_id:***:***,otel_scope_name:sequalizer/app,user_id:***}\nerror 
details: name = Unknown  desc = total_point_count:15 success_point_count:14 errors:{status:{code:3} point_count:1}", 
"rejected_items": 13}

@dashpole
Copy link
Contributor

dashpole commented Mar 1, 2024

Thats odd... and probably unrelated to queueing being enabled or disabled. Usually that error occurs when sending the same timeseries twice within a short period of time. Is that error transient, or are you still not seeing any metrics in cloud monitoring?

@matthewcyy
Copy link
Author

matthewcyy commented Mar 1, 2024

We made the config changes on our staging env and that's also where the log is coming from. We're not seeing most metrics, and not seeing any histogram metrics in cloud monitoring on staging after these changes. We've seen a "One or more TimeSeries could not be written" error before on cloud monitoring, and we've seeing the "Try enabling sending_queue" several times recently after the changes

@dashpole
Copy link
Contributor

dashpole commented Mar 4, 2024

"Try enabling sending_queue" is just appended to all error messages when you have it disabled.

And to confirm, removing the minScale annotation fixes the issue, right?

The only other thing that comes to mind is that the GMP exporter doesn't support exponential histograms, in-case the application recently switched to sending those. IIRC, warnings would appear in your logs if that was the case, though.

@dashpole
Copy link
Contributor

dashpole commented Mar 4, 2024

Otherwise, its hard to try and find the issue without a reproduction. If you can find a simple app that reproduces it, that would be helpful.

The other approach we can take to debug is to use the file exporter to get the OTLP you are trying to send in json format. I can replay that in my own project to try and reproduce the issue.

@matthewcyy
Copy link
Author

Ok I'll first try reproducing the issue with the sample app and collector from this tutorial since this is where we started with OTel https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar. I plan to use a python server similar to the Golang server being used in the tutorial and using OTLP with the same config we're currently using to see if the issue is reproducible

@AkselAllas
Copy link

@dashpole I am also seeing Points must be written in order

If Container A and Container B are sending the same histogram metric e.g. duration with same labels to collector, then from collector it goes to GCP.

Does GCP discard one of those histogram metric values?

@dashpole
Copy link
Contributor

@AkselAllas if the resource attributes are also the same between container A and container B, they will collide and you will get that error (or other errors). To fix it, make sure you are using the GCP resource detector, and, if on k8s, use the k8sattributes processor, and set k8s.container.name using the OTEL_RESOURCE_ATTRIBUTES environment variable.

@AkselAllas
Copy link

AkselAllas commented Mar 11, 2024

I'm on cloud run and I don't see specific container id(instance id not revision id) here:
https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/detectors/node/opentelemetry-resource-detector-gcp/src/detectors/GcpDetector.ts

But if I add an UUID to service.instance.id resource attribute, will I be fine?

@dashpole
Copy link
Contributor

Looks like that one doens't have support for cloud run. Try using the detector from the GoogleCloudPlatform, similar to this: https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/blob/main/samples/metrics/index.js#L38

@iydoowii
Copy link

Working with @matthewcyy on this issue.

updated code to use opentelemetry-resourcedetector-gcp python library, and it seems it support cloud run

https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/blob/v1.6.0/opentelemetry-resourcedetector-gcp/src/opentelemetry/resourcedetector/gcp_resource_detector/__init__.py#L168

but still having the issue

Points must be written in order. One or more of the points specified had an older start time than the most recent point

and this error never goes away(until the cloud run instance terminated). I see this error for the same metric every minute.

But I see that metric is created on Google managed Prometheus and there are some intervals.

Error:

{
  "textPayload": "2024-03-27T06:44:09.961Z\terror\texporterhelper/queue_sender.go:97\tExporting failed.
   Dropping data.\t{\"kind\": \"exporter\", \"data_type\": \"metrics\", \"name\": \"googlemanagedprometheus\", \"error\": \"rpc error: 
  code = InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order.
   One or more of the points specified had an older start time than the most recent point.:
   prometheus_target{instance:00a22404dc4edb3e847b340f138366fea723159df6348e4a4b080ebdc6e5c99e3fa93687b5626836ab3e44b5df9422e40093c8121b86a08825021909c7a80056,location:myregion,job:my-service,cluster:,namespace:}
    timeSeries[0-11,18-26]: prometheus.googleapis.com/requests_total/counter{dataset_id:app_demo,user_id:myuser@domain.io,otel_scope_version:,
    question:myquestion,otel_scope_name:myservice/app}; 
    The new labels would cause the metric prometheus.googleapis.com/latency_validation_seconds/histogram to have over 200 labels.;
     The new labels would cause the metric prometheus.googleapis.com/generation_tokens/histogram to have over 200 labels.; 
     The new labels would cause the metric prometheus.googleapis.com/latency_retrieval_seconds/histogram to have over 200 labels.;
      The new labels would cause the metric prometheus.googleapis.com/latency_generation_seconds/histogram to have over 200 labels.; 
      The new labels would cause the metric prometheus.googleapis.com/validation_tokens/histogram to have over 200 labels.; 
      The new labels would cause the metric prometheus.googleapis.com/tokens/histogram to have over 200 labels.\\
      nerror details: name = Unknown  desc = total_point_count:27  success_point_count:19  errors:{status:{code:3}  point_count:8}\", \"dropped_items\": 25}",
  "insertId": "6603c03a0003f5d77c676455",
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "project_id": "my-staging-1234",
      "location": "us-east1",
      "revision_name": "my-service-00395-tjf",
      "service_name": "my-service",
      "configuration_name": "my-service"
    }
  },
  "timestamp": "2024-03-27T06:44:10.259543Z",
  "labels": {
    "container_name": "otel-collector",
    "instanceId": "00a22404dc4edb3e847b340f138366fea723159df6348e4a4b080ebdc6e5c99e3fa93687b5626836ab3e44b5df9422e40093c8121b86a08825021909c7a80056"
  },
  "logName": "projects/xyz/logs/run.googleapis.com%2Fstderr",
  "receiveTimestamp": "2024-03-27T06:44:10.510742841Z"
}

When I query the metric with this instance ID, I can see my data in the metric explorer.

rate(requests_total{instance="00a22404dc4edb3e847b340f138366fea723159df6348e4a4b080ebdc6e5c99e3fa93687b5626836ab3e44b5df9422e40093c8121b86a08825021909c7a80056"}[${__interval}])

I download the metric data in CSV on metric explorer. I see that it has data for every 10 second. Is this expected?
If the metric is sent to the prometheus successfully, why app/otel collector tries to send it again and again? is it a bug in our code or OK by design?

image

in our case, we got a question from user and create a metric for it(just and example metric)

  app_metrics["requests"] = meter.create_counter(
        "requests", unit="1", description="The number of requests to the server"
    )

@AkselAllas
Copy link

AkselAllas commented May 13, 2024

For me:

But if I add an UUID to service.instance.id resource attribute, will I be fine?

Didn't work

https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/blob/main/samples/metrics/index.js#L38

Worked

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 15, 2024
@dashpole
Copy link
Contributor

@iydoowii sorry this fell through the cracks. I added some documentation that should help explain these sorts of errors: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/googlemanagedprometheusexporter/README.md#troubleshooting.

I suspect you have "colliding" timeseries, which is why you can see metrics in cloud monitoring, but also get errors when exporting.

Since it is python, one possibility is that the conflicts are from different processes, which is a common issue. It might be similar to this example https://cloud.google.com/trace/docs/setup/python-ot#config-otel:

The Flask app relies on Gunicorn to serve HTTP requests following the recommendations in Flask's Deploying to Production guide. Gunicorn starts multiple copies of your app running in independent worker processes to increase throughput. To ensure that metrics from the worker processes don't conflict with each other, we recommend that each worker process sets a unique value for the service.instance.id resource attribute. One way to do this is by including the process ID in the service.instance.id. For more information, see Time-series collisions.

@dashpole dashpole removed the Stale label Jul 15, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/googlemanagedprometheus Google Managed Prometheus exporter Stale
Projects
None yet
Development

No branches or pull requests

5 participants