Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add process.uptime and system.uptime metrics to semantic conventions #2824

Conversation

andrzej-stencel
Copy link
Member

Fixes open-telemetry/semantic-conventions#648

Changes

Adds new metrics: process.uptime and system.uptime to the semantic conventions.

Related issue: open-telemetry/opentelemetry-collector-contrib#14130

| `process.threads` | UpDownCounter | {threads} | Process threads count. | |
| `process.open_file_descriptors` | UpDownCounter | {count} | Number of file descriptors in use by the process. | |
| `process.context_switches` | Counter | {count} | Number of times the process has been context switched. | `type` SHOULD be one of: `involuntary`, `voluntary` |
| `process.uptime` | Counter | s | Number of seconds that the process has been running. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a counter or gauge?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be a gauge, the value represents the uptime of the system at the given time of recording.
Any further aggregation or calculations hold no additional statistical meaning.

Copy link
Contributor

@jamesmoessis jamesmoessis Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, gauge is more appropriate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be debating whether this is a Counter or an UpDownCounter. I will put my rationale in the main thread.

Copy link
Member

@reyang reyang Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we agree on "what is uptime"? I suspect we are not on the same page 🤣

https://en.wikipedia.org/wiki/Uptime#Using_uptime

If let's say have a Chrome browser running 10 tabs (which might give us 11 processes), do we expect the uptime to be added across the 11 processes, and what does that mean?

If the browser is closed, then reopened, what does the uptime mean?

The Linux uptime command seems to be focusing on "how long has it been since the operating system started" https://en.wikipedia.org/wiki/Uptime#Linux, and the uptime would reset if the system restarted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the semantic here https://en.wikipedia.org/wiki/Uptime#Records

A Cisco router has been reported to have been running continuously for 21 years.

then making it a counter sounds like wrong?

Copy link
Contributor

@jamesmoessis jamesmoessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments. Thanks for raising this PR!

| `process.threads` | UpDownCounter | {threads} | Process threads count. | |
| `process.open_file_descriptors` | UpDownCounter | {count} | Number of file descriptors in use by the process. | |
| `process.context_switches` | Counter | {count} | Number of times the process has been context switched. | `type` SHOULD be one of: `involuntary`, `voluntary` |
| `process.uptime` | Counter | s | Number of seconds that the process has been running. | |
Copy link
Contributor

@jamesmoessis jamesmoessis Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, gauge is more appropriate

@@ -29,6 +30,14 @@ instruments not explicitly defined in the specification.

## Metric Instruments

### `system.` - General system metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question I asked myself - would it make sense to have a namespace for system metadata like this? Perhaps system.info.*. It would mean you could group other system information together like system.info.boottime, system.info.uptime and any others. Personally I'm not sure, but it's something to think about if we are looking to add other system metadata to the semconv.

@jmacd
Copy link
Contributor

jmacd commented Sep 26, 2022

Counter is appropriate because the value is monotonic. We are explicitly interested in detecting resets via this metric, which suggests that UpDownCounter is appropriate.

Note the expression rate(uptime) is definitely meaningful and useful, it's semantically identical (but operationally different than) the Prometheus up metric. However, if we let uptime be a Counter and the user prefers Delta aggregation temporality preference, the exported data (i.e., uptime in delta temporality) substantially loses utility but is (IMO) technically still correct. This suggests, again, UpDownCounter to avoid degraded utility due to Delta temporality.

I think uptime should not be a Gauge because Gauge metric series do not include start timestamps, which is a key aspect of detecting overlapping series -- IMO critical if we are to derive an up-like metric from uptime.

As for statistics, the sum (thus, the rate) can meaningfully be aggregated. Consider 10 processes running at a point in time--the rate(sum(uptime)) is 10 and equals sum(up) in a Prometheus setting.

This works for spatial aggregation: the rate(uptime) usefully equals the number of processes that are up and the sum meaningfully equals their total uptime. This works with subdivided metrics: I can label uptime with an exclusive state attribute (e.g., "idle", "running", "shutdown", ...), now the sum and the rate and be grouped by states.

This also works for temporal aggregation: if a process has been started and stopped and restarted over a period of time, we can divide rate(uptime) by the elapsed time to derive the processes fractional uptime. You can divide process.cpu.time / process.uptime to calculate average CPUs used by one or many processes.

Note that I'm writing rate(uptime) informally. The important detail hiding here is that I want to query this like a Counter in the sense that when a process disappears and restarts, the "reset" which results in a constant offset in the timeseries does not impact the rate. Thus for most purposes, we should think of uptime as a Counter we explicitly do not reset, thus an UpDownCounter.

@jamesmoessis
Copy link
Contributor

@jmacd you put a good case forward for it being an UpDownCounter, and after reading your points I think I agree.

It seems that an Asynchronous UpDownCounter would result in a meaningful aggregations for these metrics, as per your examples. Note that it must be asynchronous because the absolute value is reported, and the API doesn't allow synchronous counters to report absolute values. These implementation details are not referenced in the semantic convention but it's worthwhile mentioning here.

@jsuereth
Copy link
Contributor

jsuereth commented Oct 4, 2022

I have a few complaints, mostly based on naming.

  • up and uptime metrics can be problematic in practice. Particularly if we use UpDownCounter in prometheus we may be creating an alerting nightmare for folks interacting with Gauge alignment issues.
  • Having a metric which captures proccess.running_time seems reasonable to me. However just because the process is running doesn't mean it's "up". For that you need additional health checks, like some kind of synthetic monitoring or other "alive" signal from the process. given the description though, I would not label this process.uptime, particularly because "Up" and "health" metrics should really be derived from raw signals, of which this appears to be one.

@andrzej-stencel andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from 78e91f5 to ad4ce5e Compare October 5, 2022 09:03
@andrzej-stencel
Copy link
Member Author

I have a few complaints, mostly based on naming.

* `up` and `uptime` metrics can be problematic in practice.   Particularly if we use UpDownCounter in prometheus we may be creating an alerting nightmare for folks interacting with Gauge alignment issues.

* Having a metric which captures `proccess.running_time` seems reasonable to me.  However just because the process is running doesn't mean it's "up".  For that you need additional health checks, like some kind of synthetic monitoring or other "alive" signal from the process.   given the description though, I would _not_ label this process.uptime, particularly because "Up" and "health" metrics should really be derived from raw signals, of which this appears to be one.

I see your point. I agree wrt. to the up metric, not sure about the uptime. How about system.uptime - does this sound OK? The uptime name is common to describe the time a system has been running, regardless of whether it was responsive or not.

Would your preference Josh be to name the system metric system.uptime and the process metric process.running_time? Or name both running_time?

@reyang
Copy link
Member

reyang commented Oct 5, 2022

I see your point. I agree wrt. to the up metric, not sure about the uptime. How about system.uptime - does this sound OK? The uptime name is common to describe the time a system has been running, regardless of whether it was responsive or not.

Would your preference Josh be to name the system metric system.uptime and the process metric process.running_time? Or name both running_time?

I think there are three things we should have clear distinction.

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).

I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours. It is totally fine if we want to use different terms for these concepts, I just want to point out that these are different things and so far from this PR it's hard to understand what/which exactly do we want.

@andrzej-stencel
Copy link
Member Author

andrzej-stencel commented Oct 6, 2022

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).

I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours.

Thanks Reiley for putting it out clearly. Yes, you could say these are three different pieces of information.

Where this pull request started is this proposal in contrib repo: open-telemetry/opentelemetry-collector-contrib#14130. It talks specifically about system.uptime (and not process.uptime) in the context of existing functionality in other metrics collectors. Collectd/SignalFX has the uptime plugin, Telegraf has the system input, both report the system uptime (at least for Linux) as defined in /proc/uptime, i.e. the uptime of the system (including time spent in suspend), which I suppose in the example above would be 5 hours?

@reyang
Copy link
Member

reyang commented Oct 6, 2022

Imagine a server started at 10AM, running till 11AM then put to sleep/hibernate, woke up at 1PM and run till it was shutdown at 2PM. Then the server started again at 3PM and run till now (4PM).
I guess the uptime would be 1 hour, the "runaway time" would be 5 hours, and the "running time" would 3 hours.

Thanks Reiley for putting it out clearly. Yes, you could say these are three different pieces of information.

Where this pull request started is this proposal in contrib repo: open-telemetry/opentelemetry-collector-contrib#14130. It talks specifically about system.uptime (and not process.uptime) in the context of existing functionality in other metrics collectors. Collectd/SignalFX has the uptime plugin, Telegraf has the system input, both report the system uptime (at least for Linux) as defined in /proc/uptime, i.e. the uptime of the system (including time spent in suspend), which I suppose in the example above would be 5 hours?

@astencel-sumo thanks! I think we should clarify it in the spec to avoid confusion and misinterpretation. Wikipedia seems to have different explanation, e.g. https://en.wikipedia.org/wiki/Uptime#Determining_system_uptime

image

If it is 5 hours, it looks like a good fit for Counter; if it is 1 hour, it looks like a good fit for Gauge.

@andrzej-stencel andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from ad4ce5e to c4bfb93 Compare October 13, 2022 13:09
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic and the wording "has been running" are murky. I think we should be crystal clear about #2824 (comment).

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 27, 2022
@andrzej-stencel andrzej-stencel force-pushed the add-process-system-uptime-metrics branch from c4bfb93 to 97994e9 Compare October 31, 2022 14:05
@andrzej-stencel
Copy link
Member Author

I don't currently have an answer on how to resolve this. Given that this PR does not block my work and that I have other more important PRs that I want to focus on, I'm going to close this PR. For anybody interested in continuing this work, feel free to use this work in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add system uptime metric
6 participants