Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification needed for ".utilization" metrics convention #819

Closed
jmacd opened this issue Aug 17, 2020 · 5 comments
Closed

Clarification needed for ".utilization" metrics convention #819

jmacd opened this issue Aug 17, 2020 · 5 comments
Labels
spec:metrics Related to the specification/metrics directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Aug 17, 2020

What are you trying to achieve?

OTEP #119 specified a convention for metrics ending in ".utilization":
open-telemetry/oteps#119

It's not clear how to implement this in some cases, clarification may be needed. For a metric such as process.cpu.time which is emitted as a cumulative value (e.g., from a SumObserver), we'll naturally be able to compute a cumulative utilization score, i.e., the total CPU time used divided by the total time. This number, the lifetime utilization, may not be very useful. It would be perhaps more useful expressed as "Interval" temporality. The ".utilization" for cumulative time metrics has the same problem as Summary data points have, that they are rarely useful in cumulative form. Moreover, they can be derived in a backend.

Should we drop ".utilization" metrics for CPU usage? Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)? (@aabmass)

@jmacd jmacd added the spec:metrics Related to the specification/metrics directory label Aug 17, 2020
@jmacd
Copy link
Contributor Author

jmacd commented Aug 18, 2020

@bogdandrutu and @open-telemetry/specs-metrics-approvers regarding open-telemetry/opentelemetry-proto#199

@MrAlias
Copy link
Contributor

MrAlias commented Aug 18, 2020

Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)?

That would make sense to me.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 18, 2020

To me, there is still a minor concern. We have argued that SumObserver and UpDownSumObserver should accept cumulative inputs so that they can remain stateless. Observer callbacks do not need to know the last time they were called or remember the last value.

In order for instrumentation to compute CPU utilization from am Observer callback breaks this rule. The callback has to remember the last timestamp it reported and the last value it recorded in order to output the current interval's utilization.

The final destination of a *.cpu.time metric can also just compute *.cpu.utilization itself, since it has presumably accumulated a series of measurements. Maybe we can specify that utilization metrics can be generated by a stateful Observer callback or can be generated by a stateful receiver downstream, to leave the possibilities open.

@aabmass
Copy link
Member

aabmass commented Aug 18, 2020

Should we drop ".utilization" metrics for CPU usage?

I think it would great to keep it, but as you mentioned it could be added back in. Which way are you leaning @jmacd?

It would be perhaps more useful expressed as "Interval" temporality.
...
Should we specify they be conveyed as Interval summaries (i.e., Difference in cumulative usage divided by difference in time)?

Are we talking about the SDK instrument to use or the OTLP temporality? My takeaway from the today's (Tuesday) meeting was that using a stateful ValueObserver (where the last value and call time is saved by the callback from its previous call) would be the easiest way to implement this with the SDK. This would send an OTLP gauge which seems ok to me. Then once we have views, it would be best to calculate this from the system.cpu.time SumObserver directly.

This does see like a common use case though, it's called out in the Metrics API spec a few times: "monotonic instruments are useful for monitoring rate information." Is "calculating" here meaning with a view or in the backend? Someone also mentioned OTEP 88 had a proposal for this interval/delta, but no concrete use cases. Would something like request rate not be an equivalent synchronous example of this ( # requests in interval / time delta )?

@jmacd
Copy link
Contributor Author

jmacd commented Aug 19, 2020

This was discussed in the 8/18 Metrics SIG (OTLP) meeting. We agreed to address this in the short term by using stateful Observer callbacks that track both their last CPU time measurement and their last timestamp.

A side-note was raised relevant to OTLP: If we had a way to encode deltas from observer instruments, it would be natural to do so here. OTLP actually supports this concept, but we have not standardized any form of Delta Observer, and this may be such a special case that we continue to ignore this matter. However, if we had a Delta Observer then it would be natural to encode "CPU time elapsed" measurements. We compute *.cpu.utilization for an interval as the rate of CPU time elapsed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

No branches or pull requests

3 participants