Skip to content

Commit

Permalink
Tracing guide (#568)
Browse files Browse the repository at this point in the history
* Add tracing guide

* Update doc/observability/tracing.md

Co-authored-by: Adam Cattermole <acatterm@redhat.com>

* Address feedback

---------

Co-authored-by: Adam Cattermole <acatterm@redhat.com>
  • Loading branch information
david-martin and adam-cattermole committed Apr 25, 2024
1 parent bed695f commit 0285586
Show file tree
Hide file tree
Showing 3 changed files with 122 additions and 0 deletions.
Binary file added doc/observability/grafana_trace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/observability/grafana_tracing_loki.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
122 changes: 122 additions & 0 deletions doc/observability/tracing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Enabling tracing with a central collector

## Introduction

This guide outlines the steps to enable tracing in Istio and Kuadrant components (Authorino and Limitador), directing traces to a central collector for improved observability and troubleshooting. We'll also explore a typical troubleshooting flow using traces and logs.

## Prerequisites

- A Kubernetes cluster with Istio and Kuadrant installed.
- A trace collector (e.g., Jaeger or Tempo) configured to support [OpenTelemetry](https://opentelemetry.io/) (OTel).

## Configuration Steps

### Istio Tracing Configuration

Enable tracing in Istio by using the [Telemetry API](https://istio.io/v1.20/docs/tasks/observability/distributed-tracing/telemetry-api/).
Depending on your method for installing Istio, you will need to configure a tracing `extensionProvider` in your MeshConfig, Istio or IstioOperator resource as well.
Here is an example Telemetry and Istio config to sample 100% of requests, if using the Istio Sail Operator.

```yaml
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: tempo-otlp
randomSamplingPercentage: 100
---
apiVersion: operator.istio.io/v1alpha1
kind: Istio
metadata:
name: default
spec:
namespace: istio-system
values:
meshConfig:
defaultConfig:
tracing: {}
enableTracing: true
extensionProviders:
- name: tempo-otlp
opentelemetry:
port: 4317
service: tempo.tempo.svc.cluster.local
```
### Kuadrant Tracing Configuration
The Authorino and Limitador components have request tracing capabilities.
Here is an example configuration to enable and send traces to a central collector.
Ensure the collector is the same one that Istio is sending traces so that they can be correlated later.
```yaml
apiVersion: operator.authorino.kuadrant.io/v1beta1
kind: Authorino
metadata:
name: authorino
spec:
tracing:
endpoint: rpc://tempo.tempo.svc.cluster.local:4317
insecure: true
---
apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
name: limitador
spec:
tracing:
endpoint: rpc://tempo.tempo.svc.cluster.local:4317
```
Once the changes are applied, the authorino and limitador components will be redeployed tracing enabled.
**Note:**
There are [plans](https://github.com/Kuadrant/architecture/issues/48) to consolidate the tracing configuration to a single location i.e. the Kuadrant CR.
This will eventually eliminate the need to configure tracing in both the Authorino and Limitador CRs.
**Important:**
Currently, trace IDs [do not propagate](https://github.com/envoyproxy/envoy/issues/22028) to wasm modules in Istio/Envoy, affecting trace continuity in Limitador.
This means that requests passed to limitador will not have the relavant 'parent' trace ID in its trace information.
If however the trace initiation point is outside of Envoy/Istio, the 'parent' trace ID will be available to limitador and included in traces passed to the collector.
This has an impact on correlating traces from limitador with traces from authorino, the gateway and any other components in the path of requests.
## Troubleshooting Flow Using Traces and Logs
Using a tracing interface like the Jaeger UI or Grafana, you can search for trace information by the trace ID.
You may get the trace ID from logs, or from a header in a sample request you want to troubleshoot.
You can also search for recent traces, filtering by the service you want to focus on.
Here is an example trace in the Grafana UI showing the total request time from the gateway (Istio), the time to check the curent rate limit count (and update it) in limitador and the time to check auth in Authorino:
<img src="./grafana_trace.png" alt="Trace in Grafana UI" width="800"/>
In limitador, it is possible to enable request logging with trace IDs to get more information on requests.
This requires the log level to be increased to at least debug, so the verbosity must be set to 3 or higher in the Limitador CR. For example:
```yaml
apiVersion: limitador.kuadrant.io/v1alpha1
kind: Limitador
metadata:
name: limitador
spec:
verbosity: 3
```
A log entry will look something like this, with the `traceparent` field holding the trace ID:

```
"Request received: Request { metadata: MetadataMap { headers: {"te": "trailers", "grpc-timeout": "5000m", "content-type": "application/grpc", "traceparent": "00-4a2a933a23df267aed612f4694b32141-00f067aa0ba902b7-01", "x-envoy-internal": "true", "x-envoy-expected-rq-timeout-ms": "5000"} }, message: RateLimitRequest { domain: "default/toystore", descriptors: [RateLimitDescriptor { entries: [Entry { key: "limit.general_user__f5646550", value: "1" }, Entry { key: "metadata.filter_metadata.envoy\\.filters\\.http\\.ext_authz.identity.userid", value: "alice" }], limit: None }], hits_addend: 1 }, extensions: Extensions }"
```
If you centrally aggregate logs using something like promtail and loki, you can jump between trace information and the relevant logs for that service:
<img src="./grafana_tracing_loki.png" alt="Trace and logs in Grafana UI" width="800"/>
Using a combination of tracing and logs, you can visualise and troubleshoot reuqest timing issues and drill down to specific services.
This method becomes even more powerful when combined with [metrics](https://docs.kuadrant.io/kuadrant-operator/doc/observability/metrics/) and [dashboards](https://docs.kuadrant.io/kuadrant-operator/doc/observability/dashboards/) to get a more complete picture of your users traffic.

0 comments on commit 0285586

Please sign in to comment.