Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Span.Status from API #706

Closed
iNikem opened this issue Jul 16, 2020 · 43 comments
Closed

Remove Span.Status from API #706

iNikem opened this issue Jul 16, 2020 · 43 comments
Assignees
Labels
area:api Cross language API specification issue area:error-reporting Related to error reporting priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:trace Related to the specification/trace directory

Comments

@iNikem
Copy link
Contributor

iNikem commented Jul 16, 2020

It was proposed in open-telemetry/oteps#123 (comment) and subsequent discussion that we should drop Span.Status from the spec. If there is an agreement on that, I can create a PR.

@iNikem iNikem added the spec:trace Related to the specification/trace directory label Jul 16, 2020
@anuraaga
Copy link
Contributor

There have been many PRs related to errors so would help if you could summarize the current state of what we have if Status is removed. For example, is there a generic way of knowing whether a span is a success or failure? From what I can tell, presence of exception is that, but I'm not sure

@Oberon00
Copy link
Member

Other affected areas would be:

  1. Semantic conventions: Most will get shorter, but the gRPC one will then need a definition how to transport the status code.
  2. Protocol: Since the trace area is declared stable (beta?), these cannot be immediately removed but should probably be marked deprecated.
  3. Of course all the language SIGs are affected too.

@Oberon00
Copy link
Member

@anuraaga There is no generic way to detect if a span has failed. The status is also not really suitable for this, except if you also consider things such as HTTP 404 as errors.

Still, I think a generic error=true or error=name_of_attribute_with_detailed_info (e.g. error=http.status_code) could be introduced before removing the status iff we can find agreement on such a thing.

Exceptions are not the same as errors. Often a span with at least one exception will be failed, but all of them could be handled too. Also, errors can occur without exceptions.

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2020

Presence of exception does not necessarily mean span was a failure. This was discussed in #697. And I have just discovered that spec does not have any error=true attribute, so something should be proposed to replace status... Not as easy as I thought :)

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2020

The simplest replacement would indeed be generic error=<boolean> field or attribute.

@Oberon00
Copy link
Member

Could we create a separate issue for discussing how to indicate errors generically, once status is removed? Or maybe #599 is the place to discuss this.

@Oberon00 Oberon00 added area:api Cross language API specification issue area:error-reporting Related to error reporting labels Jul 16, 2020
@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2020

I am afraid we cannot just remove status without offering any way to signal that this span signifies an error. The simplest way forward, assuming that we have #697 in its current form, is the following:

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2020

During todays Errors WG meeting it was agreed:

  • To remove Span.status field from API and spec
  • To mark Span.status as deprecated in OTLP
  • To introduce new boolean field "error" to spec, API and OTLP
  • To provide a migration advice to existing customers from status to error

As a separate step to start discussion about adding severity notion to Span reusing log severity concept

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2020

A discussion about severity on Span events is happening in #698

@kbrockhoff
Copy link
Member

Just a top level boolean failed field should be enough to identify the Span as having an error should be sufficient as a quick check. If this is true then the span attributes can be checked for further info. The other change that needs to be made is to add a gRPC semantic convention to report gRPC status code since this will now not be in status.

@justinfoote
Copy link
Member

Collapsing the span status field into a single boolean seems problematic to me. The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition. This implies some configuration API that something uses to set the desired error conditions, and another API that instrumentation must use to configure itself.

The instrumentation should apply as much information it can to the spans and metrics it creates; it's easier to reduce granularity of error conditions downstream than it is to increase it.
In my opinion, this complexity belongs in an exporter, or in a metric view, or in a vendor's back-end.

@Oberon00
Copy link
Member

Oberon00 commented Jul 17, 2020

Collapsing the span status field into a single boolean seems problematic to me.

Rather than collapsing, I think we are exploding it: Instead of a single canonical status we will have:

  • http.status_code (already exists)
  • exception events (just merged; handled/unhandled distinction will hopefully be added)
  • probably rpc.grpc.status_code
  • Maybe status.hresult , status.errno, status.win32_last_error, ...

Collapsing the span status field into a single boolean seems problematic to me.

Collapsing every conceivable error in a fixed StatusCode was already quite bad. The error boolean is obviously not gonna cut it alone: It's just meant as a supplement, for easier detection, it does not really add that much information in itself.

(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)

@iNikem
Copy link
Contributor Author

iNikem commented Jul 17, 2020

(Can I take this as an opportunity to plug the idea of setting the value of error not to true, but to the name of the attribute with more information?)

No :) One of the benefits of having simple boolean field is ease of reading it. Without the need to parse attributes.

We can, should and will add semantic conventions/more fields with more information about the error state of the given span. But without sacrificing easy and fast way for collector/backend/exporter to get very meaningful information from the span.

@Oberon00
Copy link
Member

Isn't that independent? If the error field/attribute is non-empty, that is equivalent to true. Only if you actually want to access the additional information, then you need to parse the error.

@Oberon00 Oberon00 added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 18, 2020
@iNikem
Copy link
Contributor Author

iNikem commented Jul 19, 2020

Why do you want that name to be in error field? Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name. And error can remain a simple, efficient, boolean.

What benefit will bring such indirect reference via error field as opposed to a separate semantic attribute?

@Oberon00
Copy link
Member

It will bring no benefit except for conserving a field name. Using a separate attribute like error.seealso sounds good too.

@Oberon00
Copy link
Member

Why should it be dynamic at all? If you want to have a separate attribute on the Span, containing some detailed information about some error condition (independent from exception events), then we can add a semantic convention for that with a fixed name.

Maybe now I understand where we are talking past each other: I think that we already have different names for error detail attributes, and should continue to have and add even more different names. For example http.status_code is one such attribute. It has well-defined semantics beyond just being "detailed error info". Still it would be cool if your backend's UI could not only add an ❌ to the span but could put "(http.status_code=500) next to it, even without knowing about the http.status_code semantic convention specifically.

@iNikem
Copy link
Contributor Author

iNikem commented Jul 20, 2020

A! Now I get it! :) I think this is overcomplicating, but I will not object it. If nobody comment on this question here, I will raise this question during Error WG this Thursday.

@Oberon00
Copy link
Member

I'm happy to find a simpler way. But some way to classify errors even without knowing a particular semantic convention the span falls into would be good (although IMHO not necessarily required for GA). Ideally without forcing instrumentations to translate their natural error conditions into some enum. 😃

@justinfoote
Copy link
Member

Still, my main concern is this:

The main problem really lies in the definition of "error" states. If the instrumentation is responsible for adding an error boolean to spans (and, presumably, a matching label to metrics), then it must be aware of what the current user considers an error condition.

Making the instrumentation responsible for knowing what constitutes an error feels like a big can of worms to me. I'd much prefer to see instrumentation add non-opinionated data to spans and metrics, and for views, exporters, and back-ends to apply their own logic to interpret that data.

@Oberon00
Copy link
Member

Hmm, you are right that this error.seealso/whatever attribute could be at most an optional hint. So maybe it is not that useful after all.

@iNikem
Copy link
Contributor Author

iNikem commented Jul 20, 2020

@justinfoote Do you oppose adding error field or removing Span.Status? I understand your concern about boolean field. It is indeed may be not that easy for auto-instrumentations to decide. But setting current status is not straightforward as well.

@justinfoote
Copy link
Member

I oppose both. 😄

I like Span.Status because it is not opinionated. It describes the state of the span in a relatively low-cardinality field, with logic that is not configurable. It doesn't make any assumptions about what the user cares about, it just tries to reflect reality in a field with a set enumeration of values.

I also dislike the error boolean for the reasons I've stated.
The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error, which exception classes should be ignored, etc... I'd prefer not to see this complexity reproduced in the OpenTelemetry agents.

@Oberon00
Copy link
Member

Oberon00 commented Jul 20, 2020

The current generation New Relic agents use an error boolean, and they must perform a handshake with the backend on startup to resolve error condition settings -- which http status codes represent an error

I don't think this is the direction we plan to go with our error. It simply should be a hint from the instrumentation when it thinks that something is likely an error. I think error=true should simply be equivalent to today's status != OK && status != UNKNOWN. EDIT: The backend could override that decision based e.g. on the http.status_code attribute if it knows about that.

@mwear
Copy link
Member

mwear commented Jul 20, 2020

At the spec SIG on Tuesday I mentioned three components of error handling that need to be addressed.

  • Semantic conventions for recording exception data (covered by Exception reporting specification #697)
  • Error identification (a flag on a span to indicate an error condition)
  • Error suppression (ways to ignore an error for an application)

We have discussed at length that there is not universal agreement on what constitutes an error for an application. I would propose that we will never reach agreement here. However, I think we can agree that if an error does happen, you want to know about it, and there should be a way to indicate this on a span. Annotating whether or not as span has had an error is a requirement, even if we can't agree on what an error is. This is what I'm calling error identification.

For the second part of this issue, we do need a way for users to decide that something that has been identified as an error, should not be considered an error for them. This is what I'm referring to as error suppression. We should give some thought into how we can facilitate this requirement from the SDK side, as well as what approaches could be used for backends.

@justinfoote
Copy link
Member

Matt, I really like this framing, and I agree that these three components of error handling need to be addressed.

The way I read this debate is something like this:

  • Is error identification the responsibility of the instrumentation, or of something downstream.
  • Same question about error suppression.

I'm arguing that they belong downstream.

Also, I believe that error identification and error suppression could be defined as a single concern. There's no reason to identify errors upstream (that is: in instrumentation) only to suppress those errors downstream (possibly in an exporter or in a vendor's back-end).

@justinfoote
Copy link
Member

I may be talking myself out of the canonical status code. Perhaps the best design is for instrumentation to add full details of exceptions and/or status codes, without passing judgement about whether that data qualifies as an error. Any interpretation of that data (that is: flagging errors with a boolean or attempting to categorize status into an enum) belongs downstream.

Exactly where downstream is something to discuss, I believe. For metrics, I think this belongs in a MetricView sort of object. For spans, I'm not sure. In a SpanProcessor? In an exporter?

@mwear
Copy link
Member

mwear commented Jul 21, 2020

The main complication, and the core reason that error suppression needs to exist is due to auto-instrumentation. Without auto-instrumentation, an application developer would not set error=true on a span if they didn't consider something to be an error condition and it would be a non-issue.

Instrumentation is where errors are going to encountered and it will to make the best decision it can with the information it has at the time. Often this will be catch an exception, record it, then re-throw. I don't think you can push this downstream because the instrumentation is where you will have a handle on the error and a chance to process it. I would also posit that in most cases, auto-instrumentation makes the right decision for errors. There are edge cases, and matters of interpretation where a user may want override that decision.

Given that auto-instrumentation is the likely generator of false positives (error=true tags or equivalent), I could see a few ways to manage this. It could be handled on the tracing backend completely. Each span has an operation name, and information identifying the instrumentation library that generated it. It would be possible to filter out errors based on user defined rules on the tracing backend.

Another possibility, is that an SDK could choose to implement something such as per instrumentation library exception handlers. Custom code be registered that could mark an exception to be recorded, or ignored on a per component basis.

The per-component exception handlers might be a way to combine identification and suppression into a single concern, but I'm not sure it's completely necessary, and a backend solution might suffice. I'd be interested in any other proposals on how this can handled.

@johnbley
Copy link
Member

I'm going to weigh in that I don't like the flavor of the current status and greatly prefer the processing simplicity of error=true. Why/how errors happened and what attributes are relevant to that are multi-dimensional - hence the need for http.status_code, stacktrace, etc. (@Oberon00 has a great list above).; for that reason I don't think a "pointer attribute" is a great approach. As for special cases like specific apps/users wanting particular HTTP status codes to be/not be an error, it's easy enough to write processor/UI logic accordingly.

@reyang reyang added the priority:p1 Highest priority level label Jul 24, 2020
@iNikem
Copy link
Contributor Author

iNikem commented Jul 27, 2020

Based on WG meeting, July 23rd, 2020, some changes are in order.

From the backward compatibility point of view it is much easier to add things later, than to remove them. There is no consensus on boolean error field or any other indication of error status of span that can be assigned by instrumenting library (or auto instrumentation).

Based on the above we are going to remove Span.Status without any replacement for the time being. The work to find such a replacement will continue, but will not block this issue's implementation.

@andrewhsu
Copy link
Member

@iNikem talked about this during the bug triage session, sounds like a PR to remove the span.status from api would be desirable to finish this issue. cc @open-telemetry/technical-committee

from the conversations, a proposal to replace will be a follow up action separate from this issue.

@iNikem
Copy link
Contributor Author

iNikem commented Aug 7, 2020

@andrewhsu I will try to prepare such PR next week. I have a concern about wire protocols. OTLP is declared stable and it has this field. OTLP->Jaeger protocol translation uses this field as well. I will reach to @tigrannajaryan and @bogdandrutu regarding these issues.

@mtwo
Copy link
Member

mtwo commented Aug 31, 2020

We need to decide this by the end of the week. Next steps:

  1. If you want to weigh in on this issue, please join the specs SIG meeting tomorrow and the errors SIG meeting on Thursday. We'll attempt to resolve this at those meetings.
  2. If no resolution is reached by Friday morning, the TC will make a decision on this issue.

@Oberon00
Copy link
Member

Oberon00 commented Sep 3, 2020

See proposal in https://docs.google.com/document/d/1HgUI69rBridFzCXxXuTjQrkG6jb-YcFQnZjPcyBcK1U/edit made during error WG meeting for a replacement:

If we keep Span.Status, but change the semantics, would that be a better solution?

  • UNSPECIFIED the status has not been set and the the backend must determine if an error has occurred by inspecting the span.
  • ERROR mark as an error, regardless of the attached semantic convention.
  • OK do not mark as an error, regardless of the attached semantic convention.

Note that there was no clear unanimous agreement on this as far as I could see. However, I'd agree that this is an improvement over the current status.

Pinging @lmolkova who originally created #306.

@tedsuo
Copy link
Contributor

tedsuo commented Sep 10, 2020

FYI this OTEP addresses this issue: open-telemetry/oteps#136

@tedsuo
Copy link
Contributor

tedsuo commented Sep 17, 2020

This issue should now be closed in favor of #965.

@carlosalberto
Copy link
Contributor

Closing this issue in favor of #965.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue area:error-reporting Related to error reporting priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.