Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span Status Improvements (without errors) #133

Closed
wants to merge 6 commits into from

Conversation

mwear
Copy link
Member

@mwear mwear commented Aug 18, 2020

We recently discussed OTEP #123 at the error working group and agreed it introduced some good ideas, but they were coupled with error reporting. Exception data was handled by open-telemetry/opentelemetry-specification#697 and the successHint inspired this error.hint PR: open-telemetry/opentelemetry-specification#822.

There are currently proposals to remove Span.status (see: open-telemetry/opentelemetry-specification#706). If we decide to keep it, this adaption of OTEP-123 could be an improvement worth considering.

See the original OTEP for commentary and rationale.

@mwear mwear requested a review from a team August 18, 2020 02:39
"gRPC", "HRESULT", "POSIX". The list is end-user-extensible but common status
type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
2. Code - An integer status code. Can be combined with an status message. Either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer is insufficient. E.g. POSIX codes are symbolic names (like ENOENT) which do not have standardized integer values (e.g. ENOENT might have a different value on Linux than on BSD)

2. Code - An integer status code. Can be combined with an status message. Either
a code or message are required.
3. Message - A string status message. Can be combined with a status code. Either
a code or message are required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a code or message are required.
a code or message is required.

Comment on lines 148 to 149
freeform message or a textual status code. Adding a 2nd string to the
StatusData would allow both to be collected side by side. This is another example of
Copy link
Member

@arminru arminru Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above (#133 (comment)).

- [#432](https://github.com/open-telemetry/opentelemetry-specification/pull/432)
- [#521](https://github.com/open-telemetry/opentelemetry-specification/pull/521)
- [#599](https://github.com/open-telemetry/opentelemetry-specification/issues/599)
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123)
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123) on which this OTEP is based
- [open-telemetry/opentelemetry-specification#697](https://github.com/open-telemetry/opentelemetry-specification/pull/697) for reporting exceptions

Comment on lines 107 to 111
Although UI creators are free experiment with how the data is presented I
expect most presentations would either be the StatusData alone, or the
StatusData qualified with the StatusType and some separator character. For
example StatusData alone might create names like "FileNotFoundException",
"503", "E_FAIL (0x80004005)", "SyntaxError on line 405: Did you forget a
semicolon?", and "BadQuery".

Exceptions could have progressive level of detail drilling into messages,
stack traces, inner exceptions, links to source, etc if the exporter serialized
sufficient data but how and whether that occurs is out-of-scope in this design.
example StatusData alone might create names like "503", "E_FAIL (0x80004005)",
"Status Code 12: Unimplemented".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated since StatusData was renamed to Code and StatusType to Domain above and Message was introduced as a separate parameter.

@tigrannajaryan
Copy link
Member

There are currently proposals to remove Span.status (see: open-telemetry/opentelemetry-specification#706). If we decide to keep it, this adaption of OTEP-123 could be an improvement worth considering.

What is the Error Working Group position on this? Do we remove Status? Do we keep it but improve? Do we have 2 opposing proposals because Error Working Group was not able to come to a concensus on this?

I would like to have more clarity on this otherwise I am not sure which of the opposing proposals to actively pursue. The OTEP to remove Status is also submitted: #134

Just to be clear: I am happy if the goal is to discuss 2 proposals in parallel and chose a winning one, just want to understand if that is the intent here.

@iNikem
Copy link
Contributor

iNikem commented Aug 19, 2020

We had much more agreement on removing current status code, than on its replacement. So we want to remove Google RPC code bases status before GA, that is what my otep is about. And separately we discuss what should we put instead of it. Either before GA or after. And for that there are several options. One is this PR. Another is open-telemetry/opentelemetry-specification#822

@tigrannajaryan
Copy link
Member

I think we need to require having a reasonable replacement for conveying erroneous state in the Span. I believe this is needed before GA. Ideally we will first know what is this replacement before removing the Status. I am also happy if the answer is that we don't need any replacement, but I would like that analysis to also be done before removal of the Status.

Copy link
Contributor

@iNikem iNikem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal definitely solves some of the problems of the current Span.Status. But it leaves one (probably the most) important question open: how to quickly determine if span represents potential failure?

a code or message are required.

Although UI creators are free experiment with how the data is presented I
expect most presentations would either be the StatusData alone, or the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatusData is not defined up to this point. What is it?

If all the data collected by the API is transmitted to the back-end this also
increases the size of transmitted telemetry, but the exporter authors always
retain the freedom to drop or reduce any information they don't believe is
valuable with potentially a slight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut sentence?


## Motivation

Right now OpenTelemetry Status is defined as an enumeration of gRPC status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not gRPC :)) I think I keep saying this. They are Google's codes which are used in gRPC but they are used in http APIs as well that Google expose.

Comment on lines +98 to +101
1. Domain - The name of the domain the status data applies to such as "HTTP",
"gRPC", "HRESULT", "POSIX". The list is end-user-extensible but common status
type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google did the same thing, and at the end they decided to remove this and support only what they called "canonical status" because custom translation were hard to keep track of.

One thing that was very important was the notion of "IsOk" on the Status itself, every domain has it's own codes, but in a lot of time was important to know if it is ok or not :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Google canonical codes are a good idea, and am generally favorable towards Span.status as it currently exists. The main issue is that we don't have guidance on how to map from different domains onto the google canonical space. If we could provide specification that explains how to translate between domains, I think it'd be easier to make the argument that we should keep Span.status as is.

@iNikem
Copy link
Contributor

iNikem commented Sep 17, 2020

@mwear Should this be closed in favor of #136 ?

Base automatically changed from master to main January 27, 2021 20:37
@tedsuo tedsuo added the triaged label Feb 6, 2023
@tedsuo
Copy link
Contributor

tedsuo commented Feb 6, 2023

@mwear please reopen if you change your mind about this

@tedsuo tedsuo closed this Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants