Remove Span.Status from API #706

iNikem · 2020-07-16T08:21:28Z

It was proposed in open-telemetry/oteps#123 (comment) and subsequent discussion that we should drop Span.Status from the spec. If there is an agreement on that, I can create a PR.

The text was updated successfully, but these errors were encountered:

anuraaga · 2020-07-16T08:25:43Z

There have been many PRs related to errors so would help if you could summarize the current state of what we have if Status is removed. For example, is there a generic way of knowing whether a span is a success or failure? From what I can tell, presence of exception is that, but I'm not sure

Oberon00 · 2020-07-16T08:26:37Z

Other affected areas would be:

Semantic conventions: Most will get shorter, but the gRPC one will then need a definition how to transport the status code.
Protocol: Since the trace area is declared stable (beta?), these cannot be immediately removed but should probably be marked deprecated.
Of course all the language SIGs are affected too.

Oberon00 · 2020-07-16T08:29:33Z

@anuraaga There is no generic way to detect if a span has failed. The status is also not really suitable for this, except if you also consider things such as HTTP 404 as errors.

Still, I think a generic error=true or error=name_of_attribute_with_detailed_info (e.g. error=http.status_code) could be introduced before removing the status iff we can find agreement on such a thing.

Exceptions are not the same as errors. Often a span with at least one exception will be failed, but all of them could be handled too. Also, errors can occur without exceptions.

iNikem · 2020-07-16T08:30:52Z

Presence of exception does not necessarily mean span was a failure. This was discussed in #697. And I have just discovered that spec does not have any error=true attribute, so something should be proposed to replace status... Not as easy as I thought :)

iNikem · 2020-07-16T08:32:17Z

The simplest replacement would indeed be generic error=<boolean> field or attribute.

Oberon00 · 2020-07-16T08:36:11Z

Could we create a separate issue for discussing how to indicate errors generically, once status is removed? Or maybe #599 is the place to discuss this.

iNikem · 2020-07-16T08:45:45Z

I am afraid we cannot just remove status without offering any way to signal that this span signifies an error. The simplest way forward, assuming that we have #697 in its current form, is the following:

Introduce simple boolean attribute/field error
Remove status
Continue a discussion in Add proper error reporting before OpenTelemetry goes GA #599 if need something more

iNikem · 2020-07-16T17:43:54Z

During todays Errors WG meeting it was agreed:

To remove Span.status field from API and spec
To mark Span.status as deprecated in OTLP
To introduce new boolean field "error" to spec, API and OTLP
To provide a migration advice to existing customers from status to error

As a separate step to start discussion about adding severity notion to Span reusing log severity concept

iNikem · 2020-07-16T17:46:55Z

A discussion about severity on Span events is happening in #698

kbrockhoff · 2020-07-16T17:59:03Z

Just a top level boolean failed field should be enough to identify the Span as having an error should be sufficient as a quick check. If this is true then the span attributes can be checked for further info. The other change that needs to be made is to add a gRPC semantic convention to report gRPC status code since this will now not be in status.

justinfoote · 2020-07-17T18:04:27Z

Collapsing the span status field into a single boolean seems problematic to me. The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition. This implies some configuration API that something uses to set the desired error conditions, and another API that instrumentation must use to configure itself.

The instrumentation should apply as much information it can to the spans and metrics it creates; it's easier to reduce granularity of error conditions downstream than it is to increase it.
In my opinion, this complexity belongs in an exporter, or in a metric view, or in a vendor's back-end.

Oberon00 · 2020-07-17T18:18:43Z

Collapsing the span status field into a single boolean seems problematic to me.

Rather than collapsing, I think we are exploding it: Instead of a single canonical status we will have:

http.status_code (already exists)
exception events (just merged; handled/unhandled distinction will hopefully be added)
probably rpc.grpc.status_code
Maybe status.hresult , status.errno, status.win32_last_error, ...

Collapsing the span status field into a single boolean seems problematic to me.

Collapsing every conceivable error in a fixed StatusCode was already quite bad. The error boolean is obviously not gonna cut it alone: It's just meant as a supplement, for easier detection, it does not really add that much information in itself.

(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)

iNikem · 2020-07-17T18:41:27Z

(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)

No :) One of the benefits of having simple boolean field is ease of reading it. Without the need to parse attributes.

We can, should and will add semantic conventions/more fields with more information about the error state of the given span. But without sacrificing easy and fast way for collector/backend/exporter to get very meaningful information from the span.

Oberon00 · 2020-07-18T09:31:18Z

Isn't that independent? If the error field/attribute is non-empty, that is equivalent to true. Only if you actually want to access the additional information, then you need to parse the error.

iNikem · 2020-07-19T11:17:20Z

Why do you want that name to be in error field? Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name. And error can remain a simple, efficient, boolean.

What benefit will bring such indirect reference via error field as opposed to a separate semantic attribute?

Oberon00 · 2020-07-19T12:07:59Z

It will bring no benefit except for conserving a field name. Using a separate attribute like error.seealso sounds good too.

Oberon00 · 2020-07-20T13:07:55Z

Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name.

Maybe now I understand where we are talking past each other: I think that we already have different names for error detail attributes, and should continue to have and add even more different names. For example http.status_code is one such attribute. It has well-defined semantics beyond just being "detailed error info". Still it would be cool if your backend's UI could not only add an ❌ to the span but could put "(http.status_code=500) next to it, even without knowing about the http.status_code semantic convention specifically.

iNikem · 2020-07-20T13:40:45Z

A! Now I get it! :) I think this is overcomplicating, but I will not object it. If nobody comment on this question here, I will raise this question during Error WG this Thursday.

Oberon00 · 2020-07-20T14:11:08Z

I'm happy to find a simpler way. But some way to classify errors even without knowing a particular semantic convention the span falls into would be good (although IMHO not necessarily required for GA). Ideally without forcing instrumentations to translate their natural error conditions into some enum. 😃

justinfoote · 2020-07-20T15:00:06Z

Still, my main concern is this:

The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition.

Making the instrumentation responsible for knowing what constitutes an error feels like a big can of worms to me. I'd much prefer to see instrumentation add non-opinionated data to spans and metrics, and for views, exporters, and back-ends to apply their own logic to interpret that data.

Oberon00 · 2020-07-20T15:03:07Z

Hmm, you are right that this error.seealso/whatever attribute could be at most an optional hint. So maybe it is not that useful after all.

iNikem · 2020-07-20T15:27:44Z

@justinfoote Do you oppose adding error field or removing Span.Status? I understand your concern about boolean field. It is indeed may be not that easy for auto-instrumentations to decide. But setting current status is not straightforward as well.

justinfoote · 2020-07-20T15:34:41Z

I oppose both. 😄

I like Span.Status because it is not opinionated. It describes the state of the span in a relatively low-cardinality field, with logic that is not configurable. It doesn't make any assumptions about what the user cares about, it just tries to reflect reality in a field with a set enumeration of values.

I also dislike the error boolean for the reasons I've stated.
The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error, which exception classes should be ignored, etc... I'd prefer not to see this complexity reproduced in the OpenTelemetry agents.

Oberon00 · 2020-07-20T15:46:00Z

The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error

I don't think this is the direction we plan to go with our error. It simply should be a hint from the instrumentation when it thinks that something is likely an error. I think error=true should simply be equivalent to today's status != OK && status != UNKNOWN. EDIT: The backend could override that decision based e.g. on the http.status_code attribute if it knows about that.

mwear · 2020-07-20T20:10:59Z

At the spec SIG on Tuesday I mentioned three components of error handling that need to be addressed.

Semantic conventions for recording exception data (covered by Exception reporting specification #697)
Error identification (a flag on a span to indicate an error condition)
Error suppression (ways to ignore an error for an application)

We have discussed at length that there is not universal agreement on what constitutes an error for an application. I would propose that we will never reach agreement here. However, I think we can agree that if an error does happen, you want to know about it, and there should be a way to indicate this on a span. Annotating whether or not as span has had an error is a requirement, even if we can't agree on what an error is. This is what I'm calling error identification.

For the second part of this issue, we do need a way for users to decide that something that has been identified as an error, should not be considered an error for them. This is what I'm referring to as error suppression. We should give some thought into how we can facilitate this requirement from the SDK side, as well as what approaches could be used for backends.

justinfoote · 2020-07-20T21:51:04Z

Matt, I really like this framing, and I agree that these three components of error handling need to be addressed.

The way I read this debate is something like this:

Is error identification the responsibility of the instrumentation, or of something downstream.
Same question about error suppression.

I'm arguing that they belong downstream.

Also, I believe that error identification and error suppression could be defined as a single concern. There's no reason to identify errors upstream (that is: in instrumentation) only to suppress those errors downstream (possibly in an exporter or in a vendor's back-end).

justinfoote · 2020-07-21T00:32:44Z

I may be talking myself out of the canonical status code. Perhaps the best design is for instrumentation to add full details of exceptions and/or status codes, without passing judgement about whether that data qualifies as an error. Any interpretation of that data (that is: flagging errors with a boolean or attempting to categorize status into an enum) belongs downstream.

Exactly where downstream is something to discuss, I believe. For metrics, I think this belongs in a MetricView sort of object. For spans, I'm not sure. In a SpanProcessor? In an exporter?

mwear · 2020-07-21T00:37:59Z

The main complication, and the core reason that error suppression needs to exist is due to auto-instrumentation. Without auto-instrumentation, an application developer would not set error=true on a span if they didn't consider something to be an error condition and it would be a non-issue.

Instrumentation is where errors are going to encountered and it will to make the best decision it can with the information it has at the time. Often this will be catch an exception, record it, then re-throw. I don't think you can push this downstream because the instrumentation is where you will have a handle on the error and a chance to process it. I would also posit that in most cases, auto-instrumentation makes the right decision for errors. There are edge cases, and matters of interpretation where a user may want override that decision.

Given that auto-instrumentation is the likely generator of false positives (error=true tags or equivalent), I could see a few ways to manage this. It could be handled on the tracing backend completely. Each span has an operation name, and information identifying the instrumentation library that generated it. It would be possible to filter out errors based on user defined rules on the tracing backend.

Another possibility, is that an SDK could choose to implement something such as per instrumentation library exception handlers. Custom code be registered that could mark an exception to be recorded, or ignored on a per component basis.

The per-component exception handlers might be a way to combine identification and suppression into a single concern, but I'm not sure it's completely necessary, and a backend solution might suffice. I'd be interested in any other proposals on how this can handled.

johnbley · 2020-07-21T15:47:01Z

I'm going to weigh in that I don't like the flavor of the current status and greatly prefer the processing simplicity of error=true. Why/how errors happened and what attributes are relevant to that are multi-dimensional - hence the need for http.status_code, stacktrace, etc. (@Oberon00 has a great list above).; for that reason I don't think a "pointer attribute" is a great approach. As for special cases like specific apps/users wanting particular HTTP status codes to be/not be an error, it's easy enough to write processor/UI logic accordingly.

iNikem · 2020-07-27T15:14:47Z

Based on WG meeting, July 23rd, 2020, some changes are in order.

From the backward compatibility point of view it is much easier to add things later, than to remove them. There is no consensus on boolean error field or any other indication of error status of span that can be assigned by instrumenting library (or auto instrumentation).

Based on the above we are going to remove Span.Status without any replacement for the time being. The work to find such a replacement will continue, but will not block this issue's implementation.

andrewhsu · 2020-08-07T15:45:28Z

@iNikem talked about this during the bug triage session, sounds like a PR to remove the span.status from api would be desirable to finish this issue. cc @open-telemetry/technical-committee

from the conversations, a proposal to replace will be a follow up action separate from this issue.

iNikem · 2020-08-07T17:39:04Z

@andrewhsu I will try to prepare such PR next week. I have a concern about wire protocols. OTLP is declared stable and it has this field. OTLP->Jaeger protocol translation uses this field as well. I will reach to @tigrannajaryan and @bogdandrutu regarding these issues.

mtwo · 2020-08-31T16:16:10Z

We need to decide this by the end of the week. Next steps:

If you want to weigh in on this issue, please join the specs SIG meeting tomorrow and the errors SIG meeting on Thursday. We'll attempt to resolve this at those meetings.
If no resolution is reached by Friday morning, the TC will make a decision on this issue.

Oberon00 · 2020-09-03T18:51:39Z

See proposal in https://docs.google.com/document/d/1HgUI69rBridFzCXxXuTjQrkG6jb-YcFQnZjPcyBcK1U/edit made during error WG meeting for a replacement:

If we keep Span.Status, but change the semantics, would that be a better solution?

UNSPECIFIED the status has not been set and the the backend must determine if an error has occurred by inspecting the span.

ERROR mark as an error, regardless of the attached semantic convention.

OK do not mark as an error, regardless of the attached semantic convention.

Note that there was no clear unanimous agreement on this as far as I could see. However, I'd agree that this is an improvement over the current status.

Pinging @lmolkova who originally created #306.

tedsuo · 2020-09-10T05:36:40Z

FYI this OTEP addresses this issue: open-telemetry/oteps#136

tedsuo · 2020-09-17T15:55:20Z

This issue should now be closed in favor of #965.

carlosalberto · 2020-09-17T16:11:07Z

Closing this issue in favor of #965.

iNikem added the spec:trace Related to the specification/trace directory label Jul 16, 2020

Oberon00 added area:api Cross language API specification issue area:error-reporting Related to error reporting labels Jul 16, 2020

Oberon00 mentioned this issue Jul 16, 2020

Add SeverityNumber to Span Events #698

Closed

iNikem mentioned this issue Jul 17, 2020

Add metrics semantic conventions for timed operations #657

Closed

Oberon00 added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 18, 2020

codeboten mentioned this issue Jul 20, 2020

Add error to span, deprecate status open-telemetry/opentelemetry-proto#174

Closed

iNikem mentioned this issue Jul 22, 2020

Library instrumentation: StatusCanonicalCode #306

Closed

reyang added the priority:p1 Highest priority level label Jul 24, 2020

pjanotti mentioned this issue Jul 28, 2020

feat: Add Sentry Exporter open-telemetry/opentelemetry-collector-contrib#565

Merged

andrewhsu assigned iNikem Aug 4, 2020

This was referenced Aug 18, 2020

Add error.hint semantic convention #822

Closed

Span Status Improvements (without errors) open-telemetry/oteps#133

Closed

iNikem mentioned this issue Aug 19, 2020

Remove Span.Status open-telemetry/oteps#134

Closed

iNikem mentioned this issue Sep 3, 2020

Remove Span.Status per OTEP-134 #918

Closed

jufab mentioned this issue Sep 4, 2020

fix: Add status exception on Span.recordException open-telemetry/opentelemetry-js#1490

Closed

tedsuo mentioned this issue Sep 9, 2020

Error flagging with status codes open-telemetry/oteps#136

Merged

tedsuo mentioned this issue Sep 17, 2020

Improve error handling #965

Closed

carlosalberto closed this as completed Sep 17, 2020

breedx-splk mentioned this issue Dec 11, 2020

Verify actual status instead of just errored open-telemetry/opentelemetry-java-instrumentation#608

Closed

Remove Span.Status from API #706

Remove Span.Status from API #706

Comments

iNikem commented Jul 16, 2020

anuraaga commented Jul 16, 2020

Oberon00 commented Jul 16, 2020

Oberon00 commented Jul 16, 2020

iNikem commented Jul 16, 2020

iNikem commented Jul 16, 2020

Oberon00 commented Jul 16, 2020

iNikem commented Jul 16, 2020

iNikem commented Jul 16, 2020

iNikem commented Jul 16, 2020

kbrockhoff commented Jul 16, 2020

justinfoote commented Jul 17, 2020

Oberon00 commented Jul 17, 2020 • edited Loading

iNikem commented Jul 17, 2020

Oberon00 commented Jul 18, 2020

iNikem commented Jul 19, 2020

Oberon00 commented Jul 19, 2020

Oberon00 commented Jul 20, 2020

iNikem commented Jul 20, 2020

Oberon00 commented Jul 20, 2020

justinfoote commented Jul 20, 2020

Oberon00 commented Jul 20, 2020

iNikem commented Jul 20, 2020

justinfoote commented Jul 20, 2020

Oberon00 commented Jul 20, 2020 • edited Loading

mwear commented Jul 20, 2020 • edited Loading

justinfoote commented Jul 20, 2020

justinfoote commented Jul 21, 2020

mwear commented Jul 21, 2020 • edited Loading

johnbley commented Jul 21, 2020

iNikem commented Jul 27, 2020

andrewhsu commented Aug 7, 2020

iNikem commented Aug 7, 2020

mtwo commented Aug 31, 2020

Oberon00 commented Sep 3, 2020 • edited Loading

tedsuo commented Sep 10, 2020

tedsuo commented Sep 17, 2020

carlosalberto commented Sep 17, 2020

Oberon00 commented Jul 17, 2020 •

edited

Loading

Oberon00 commented Jul 20, 2020 •

edited

Loading

mwear commented Jul 20, 2020 •

edited

Loading

mwear commented Jul 21, 2020 •

edited

Loading

Oberon00 commented Sep 3, 2020 •

edited

Loading