Add metrics semantic conventions for timed operations #657

justinfoote · 2020-06-15T17:49:25Z

This PR adds semantic conventions for metrics that should be broadly applicable to many timed operations. This PR attempts to implement #654.

This PR addresses naming of these metrics, but does not attempt to specify labeling. I believe that belongs in category-specific semantic conventions, as the labels used will vary across http, database, messaging, etc...

I have attempted to follow the guidelines here: open-telemetry/oteps#108

linux-foundation-easycla · 2020-06-15T17:49:29Z

The committers are authorized under a signed CLA.

✅ Justin Foote (f63a729)

specification/metrics/semantic_conventions/metric-general.md

jkwatson · 2020-06-15T18:52:51Z

How about a standard set of labels to distinguish metrics of the same category/kind, but different target? For example, my service may make many http calls as a client. There should be some sort of label to distinguish what the target of the http calls was. Same would go for databases, as my service may talk to multiple database systems.

specification/metrics/semantic_conventions/metric-general.md

justinfoote · 2020-06-15T20:27:10Z

@jkwatson, I considered this idea, but I was struggling to come up with a set of labels that I thought could be broadly applied.
The most commonly applicable is probably URL, but our span conventions have namespaced that attribute differently between categories (that is: http.url, db.url, messaging.url), and I think I would prefer that the Metric labels match the Span attributes as much as possible.

I had punted on defining labels until we make category-specific semantic conventions. I'd be happy to write them up and add them to this PR if y'all think that makes sense.

jmacd · 2020-06-16T06:04:01Z

This, along with open-telemetry/oteps#108, form the basis of a new section on metric naming for the OpenTelemetry spec. The comments discuss several ways to reflect these statistics, with one, two, or three instruments.

I am in favor of defining these statistics using a single ValueRecorder instrument with a success=true/false label. The (WIP) Metric Views API (open-telemetry/oteps#89) can be used to transform this one metric (with its default MinMaxSumCount aggregation) into the throughput, error, and duration metrics discussed in this proposal.

specification/metrics/semantic_conventions/metric-general.md

…lueRecorder instrument

…conventions

jmacd · 2020-07-17T06:44:45Z

specification/metrics/semantic_conventions/metric-general.md

+
+#### Status
+
+Recordings with the duration instrument must include a `status` label. The value of this


For me this is connected with open-telemetry/oteps#123, which has called the canonical status code into question (again). As discussed in today's SIG, there's a need for systems to easily find the status code that describes an error condition, so having a bunch of specific conventions (e.g., http.status, posix.status, ...) is not a complete solution.

The current Span Status structure notwithstanding, we could represent a variable status code space in string-valued format by prefixing the status code space name to the code name, e.g.,

status=http:404
status=posix:EINTR
status=grpc:CANCELLED

This would require a breaking code change for spans, but if Span Status included: (1) a conventional code number, (2) a conventional code space name, (3) a message, the above could be generated.

To be clear, I do not mean this comment to be blocking. The outcome of open-telemetry/oteps#123 may suggest a change in this part of the specification either way.

This requirement here is problematic in the light of #706

I don't know what is going to happen to the Span's status attribute, but this is still a requirement we want for metrics. We'd like a low-cardinality but highly informative label, with a known enumeration of possible values, that can be further collapsed downstream by a metric view, exporter, or a vendor's back-end.

With the current state of the trace spec, I feel the StatusCanonicalCode is the correct value type for this label. If the span's status attributes actually is removed, we'll need to refactor this spec, but I don't expect its intent will change.

Agreed ^^^.

jmacd · 2020-07-17T06:51:09Z

@bogdandrutu please merge?

anuraaga · 2020-07-18T03:14:06Z

specification/metrics/semantic_conventions/metric-general.md

+Recordings with the duration instrument must include a `status` label. The value of this
+label must be a valid Span [`StatusCanonicalCode`](../../trace/api.md#statuscanonicalcode).
+
+#### Labels


It seems like we need to record timing, status, and attributes of a Span to support the metrics derived from spans. Does this mean that the unsampled span becomes just a non-exported Span but otherwise stops being a no-op structure? This might add quite a bit of overhead in the unsampled case, but it does also make sense to me to use Span as a fallback generic structure for a request across all instrumentation. I wonder what about libraries that do natively record metric information in their own representation of a request, for example

https://javadoc.io/doc/com.linecorp.armeria/armeria-javadoc/latest/com/linecorp/armeria/common/logging/RequestLog.html

For these, would we recommend unsampled spans to continue to be pure no-op and use this native functionality for metrics?

You are raising good points. To export metrics from spans requires a real span object that can remember attributes from start and wait for the status and duration to be known. If there is a good (future) way to encode metrics in spans we might choose that instead. It wonder if the Span Sampler API could be used to return a reduced-size Span which would act like a no-op span but store enough information to report a duration metric.

If there is a good (future) way

Ah I didn't realize that's future work since this doc talks a lot about Spans. But I wonder is it just referring to the concept of a span (window of time) and not the data model we use in tracing?

For sampling, yeah I think it would be possible to have three types of spans, exported real span, non-exported real span, and no-op span if we needed it.

bogdandrutu

@jmacd I missed the message to merge this, but I have some questions/concerns :)

specification/metrics/semantic_conventions/metric-general.md

bogdandrutu · 2020-07-21T14:39:59Z

specification/metrics/semantic_conventions/metric-general.md

+
+When creating duration metrics from Spans, the duration **in milliseconds** of all
+_recorded_ Spans of a given category and kind should be aggregated together in a single
+[`ValueRecorder`](../api.md#valuerecorder).  A Span's duration should be recorded onto


I would not add this as requirement yet "A Span's duration should be recorded onto this instrument regardless of whether the Span is sampled.". Can we leave this to the future work?

The main reason is because this is a tracing API change not a semantic convention

Are we sure this is an API change? I imagine this can be done with the sampler API, but haven't done the exercise.

I am curious to learn how achieve this with sampler api? If SamplingDecision is NOT_RECORD, then we dont start/stop the span, and hence we won't track the duration, right?

I don't believe this spec change is about auto-generated metrics by the SDK. I believe this applies to library or manual instrumentation that wants to record timing metrics.

Let's remove that line. I don't think it's relevant for defining the semantic conventions.

I think this relates to my other point - agree that if this is about metrics and not necessarily spans we should clarify that. It's unclear whether this doc is talking about Span, the data model we use in tracing, hence these points about sampling, or just any window of time, for which we want to gather metrics and the implementation is left for the future. I'm getting the sense that it's the latter, but the wording in the entire doc seems to be tied to tracing, not just this line, below we have talk of "Labels applied to these metric recordings should follow the attribute semantic conventions
of the spans from which they are derived." which has the same issue as this line.

My intention with this PR was to define semantic conventions for metrics based on Spans. That is, based on the Span object from the tracing data model.

I like this because I feel we get some simple semantic conventions for metrics sort of by default. If you're making timing metrics, and you already have spans timing those same operations, making the metrics will be easy: just follow this naming convention, use the same semantic conventions as the Span did (adjusting for cardinality), and you're good to go.

Perhaps this is causing more confusion than it's worth. I'm happy to reframe it to be much more independent of our Span semantic conventions, but this gives me a concern that there would be a lot of duplication between our semantic conventions for duration metrics and spans, and that they would diverge over time.

All that said, I support John's suggestion of just removing this line:

A Span's duration should be recorded onto this instrument regardless of whether the Span is sampled.

Agree with wanting to reuse the semantic conventions defined for spans, I made a suggestion to try to decouple it from the implementation (particularly sampling) of tracing at least for now.

bogdandrutu · 2020-07-21T14:42:11Z

specification/metrics/semantic_conventions/metric-general.md

+For Spans that follow one of the common semantic-conventional _areas_, the category
+should be the label prefix used in that semantic convention.
+
+For example:
+
+* `http.server.duration`
+* `http.client.duration`
+* `db.client.duration`


It is a bit hard to implement with this requirement of having different name for the metric. Can we make {category} a label?

I don't understand. We should be able to generate a metric name from a span name, that's the goal, at least.

I was trying to follow the guidance from OTEP 108:

Associated metrics SHOULD be nested together in a hierarchy based on their usage.

"As a rule of thumb, aggregations over all the dimensions of a given metric SHOULD be meaningful," as Prometheus recommends.

I don't believe that an aggregation of HTTP client duration and database client duration would be meaningful, so I've suggested that we separate them with this top-level category. I certainly may be wrong; I may not have thought of all use cases.

@jmacd the problem that I see is that the {category} is unbounded, and we will end up with a lot of instruments. Not saying it is impossible, just harder than a label.

But if all feel strong about that we can keep it this way.

Perhaps it makes more sense to make the categories bounded? We could include a list of known categories in this semantic convention. Each category likely has its own semantic conventions anyway; we'll be adding those over a series of upcoming PRs.

bogdandrutu · 2020-07-21T14:44:49Z

specification/metrics/semantic_conventions/metric-general.md

+
+### Convention
+
+When creating duration metrics from Spans, the duration **in milliseconds** of all


Can you clarify that this is not a requirement to be implemented by all the libraries? I would suggest to start with implementing this in the collector from the exported spans and learn from there.

The goal of this PR is to have a semantic convention that allows the collector to do this.

really? I don't believe that's the goal at all. Where did you get this from?

I'm trying to say this is just a semantic convention. We should be able to use it for creating metrics from non-sampled spans, but that's an implementation question, not a semantic question. It's the implementation and sampler API question that I think is out of scope here.

The goal of this PR is to have a semantic convention that allows the collector to do this.

I am confused on this statement. The 'justification' section in this PR talks about inaccuracies while doing aggregation in downstream (collector), so this must be about getting metrics before sampling?

Because these Spans may be sampled, aggregating information about them downstream is subject to inaccuracies. Computing count, duration, and error rate from all Spans (including sampled=true and sampled=false) results in more accurate aggregates.

bogdandrutu

Would like to clarify the goal of this PR, as people seem to be confused by this and make sure we don't misunderstand what are the action items, where this should be implemented, who should implement this, etc.

anuraaga · 2020-07-23T02:41:39Z

specification/metrics/semantic_conventions/metric-general.md

+
+#### Labels
+
+Labels applied to these metric recordings should follow the attribute semantic conventions


Could this be written like this?

Labels applied to these metric recordings should follow the attribute semantic conventions of the corresponding span for the operation. In other words, the labels should match as best as possible the attributes the Span for the operation would have if it had been sampled.

I think this along with the duration tweak suggested above decouples this PR from the sampling aspect.

I really like the corresponding Span for the operation wording. I think I'll leave off the second sentence.
It's important that this guidance is followed by the writers of the category-specific semantic conventions. And I think it'll be clear enough for them.

anuraaga · 2020-07-23T02:47:19Z

specification/metrics/semantic_conventions/metric-general.md

+
+When creating duration metrics from Spans, the duration **in milliseconds** of all
+_recorded_ Spans of a given category and kind should be aggregated together in a single
+[`ValueRecorder`](../api.md#valuerecorder).  A Span's duration should be recorded onto


Agree with wanting to reuse the semantic conventions defined for spans, I made a suggestion to try to decouple it from the implementation (particularly sampling) of tracing at least for now.

specification/metrics/semantic_conventions/metric-general.md

aabmass · 2020-07-27T23:47:08Z

specification/metrics/semantic_conventions/metric-general.md

+
+### Convention
+
+When creating duration metrics from Spans, the duration **in milliseconds** of all


Is there a reason to use milliseconds over base units of seconds? For open-telemetry/oteps#119, I've gone with base units, fwiw.

The reason is that I hadn't read OTep 119. 😂
Fixed in c92d8f5.

ebullient · 2020-07-28T17:24:25Z

I have to be missing something. Is the goal of OT metrics to support collection of aggregated metrics only within the context of active/traced/sampled spans? I can't reconcile the metrics section of this API with some of the detailed application/framework-level metrics that are more often collected. If this spec is only intended for auto-instrumentation agents, it still conflicts with other easy-to-enable app framework level metrics. While I think standardizing the format of metrics for http server requests or http client requests has value, I am having trouble resolving the language used because it seems inextricably linked to tracing spans, and that just doesn't resolve against broader SRE practices. If the OT metrics API is specifically related to gathering additional metrics about sampled spans, then we don't have an issue.

Use seconds to record duration, rather than milliseconds

justinfoote · 2020-08-06T18:17:24Z

I am closing this PR in favor of my OTEP PR and @gfuller1's HTTP metrics PR.
I expect that we'll see more PRs to follow creating semantic conventions for databases, grpc, etc...

Add general metrics semantic conventions

f63a729

justinfoote requested review from arminru, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners June 15, 2020 17:49

justinfoote changed the title ~~Add general metrics semantic conventions~~ Add metrics semantic conventions for timed operations Jun 15, 2020

jkwatson reviewed Jun 15, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

jkwatson reviewed Jun 15, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

yurishkuro reviewed Jun 15, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

justinfoote and others added 2 commits June 15, 2020 13:12

Change duration metric to a ValueRecorder

a73803e

Specify duration metrics should be in milliseconds

f130a79

yurishkuro reviewed Jun 15, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

jmacd mentioned this pull request Jun 16, 2020

Metric naming conventions open-telemetry/oteps#108

Merged

tedpennings reviewed Jun 16, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

Replace the three timed operations metrics with a single ValueRecorder

ffa2640

dyladan reviewed Jun 17, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

Clarify that the duration metric semantic conventions refer to the Va…

6ed45ca

…lueRecorder instrument

carlosalberto added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory labels Jun 19, 2020

Fix markdownlint errors in metric-general semantic conventions

9b5446e

justinfoote force-pushed the metrics_semantic_conventions branch from 1968873 to 9b5446e Compare July 14, 2020 16:25

justinfoote added 2 commits July 14, 2020 09:57

Merge remote-tracking branch 'upstream/master' into metrics_semantic_…

4f71365

…conventions

Fix links in metrics-general semantic conventions

7417406

jmacd approved these changes Jul 16, 2020

View reviewed changes

jmacd reviewed Jul 17, 2020

View reviewed changes

jmacd mentioned this pull request Jul 17, 2020

Add support for more expressive span status API open-telemetry/oteps#123

Closed

anuraaga reviewed Jul 18, 2020

View reviewed changes

trask mentioned this pull request Jul 19, 2020

GA requirements and prioritization open-telemetry/opentelemetry-java-instrumentation#731

Closed

jmacd mentioned this pull request Jul 21, 2020

Add the requirement for the probability sampler #570

Closed

bogdandrutu reviewed Jul 21, 2020

View reviewed changes

Fix typo in specification/metrics/semantic_conventions/metric-general.md

3182288

bogdandrutu requested changes Jul 22, 2020

View reviewed changes

anuraaga reviewed Jul 23, 2020

View reviewed changes

Reduce coupling between metric semantic conventions and sampling API

c2283b6

justinfoote commented Jul 24, 2020

View reviewed changes

specification/metrics/semantic_conventions/metric-general.md Outdated Show resolved Hide resolved

Update specification/metrics/semantic_conventions/metric-general.md

8ade2de

anuraaga approved these changes Jul 25, 2020

View reviewed changes

justinfoote requested a review from bogdandrutu July 27, 2020 17:24

aabmass reviewed Jul 27, 2020

View reviewed changes

jrcamp mentioned this pull request Jul 30, 2020

Proposal: Span Stats processor open-telemetry/opentelemetry-collector-contrib#403

Closed

Update metric-general.md

c92d8f5

Use seconds to record duration, rather than milliseconds

justinfoote mentioned this pull request Aug 5, 2020

Add semantic conventions for HTTP metrics #739

Merged

justinfoote closed this Aug 6, 2020

justinfoote mentioned this pull request Oct 26, 2020

Define semantic conventions for timing metrics #1035

Open


		#### Status

		Recordings with the duration instrument must include a `status` label. The value of this


		### Convention

		When creating duration metrics from Spans, the duration in milliseconds of all


		#### Labels

		Labels applied to these metric recordings should follow the attribute semantic conventions

Add metrics semantic conventions for timed operations #657

Add metrics semantic conventions for timed operations #657

Conversation

justinfoote commented Jun 15, 2020

linux-foundation-easycla bot commented Jun 15, 2020 • edited Loading

jkwatson commented Jun 15, 2020

justinfoote commented Jun 15, 2020

jmacd commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd commented Jul 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdandrutu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdandrutu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebullient commented Jul 28, 2020

justinfoote commented Aug 6, 2020

linux-foundation-easycla bot commented Jun 15, 2020 •

edited

Loading