Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Span.Status #134

Closed
wants to merge 8 commits into from
Closed
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions text/0134-remove-span-status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Remove Span.Status

**Author:** Nikita Salnikov-Tarnovski, Splunk

Remove `Span.status` and related APIs and conventions.

## Motivation

During numerous heated discussions in various pull requests (see
<https://github.com/open-telemetry/opentelemetry-specification/issues/599>
<https://github.com/open-telemetry/opentelemetry-specification/pull/697>
<https://github.com/open-telemetry/opentelemetry-specification/pull/427>
<https://github.com/open-telemetry/oteps/pull/69>
<https://github.com/open-telemetry/oteps/pull/86>
<https://github.com/open-telemetry/oteps/pull/123>)
and during Errors Working Group meetings a lot of concerns were raised about suitability of the current `Span.Status`
field encoded as Google RPC code for recording a result of arbitrary monitored operations from various domains
(http requests, database queries, file system operations etc). It is hard to define clear and umambigous mapping
from those domain to a fixed enum originated from RPC.

Therefore the proposal is to remove the `Span.Status` field from Trace API and OTLP.

## Internal details

Several areas are affected by this change and will require a coordinated effort to make this happen.

### Trace API Specification

* Remove all mentions of `Status` and `Set Status` API from Span section.
* Remove whole `Status` section.

### Trace SDK Specification

No changes needed.

### Trace SDK exporters

Remove translation rules pertaining to `Span.Status`.

### Span semantic conventions

* From HTTP conventions, remove `Status` section.
* In gRPC conventions, replace the requirement to set span status with the requirement to set span attribute `grpc.status`
Oberon00 marked this conversation as resolved.
Show resolved Hide resolved
to the canonical status code of the gRPC response.

### OTLP data definition

Rename `Span.status` field to `Span.deprecated_status` field.

### Language API and SDK

* Every language must remove status related code from their implementations of Trace API and SDK.
* Every language must remove all code in their span exporters which serializes span status.
* All instrumentation libraries, both manual and automated, must remove all calls to the removed `Span.setStatus` API.
* All instrumentation libraries, both manual and automated, must adapt new semantic conventions for gRPC spans.

### Collector

Collector will have a transition period during which it has a special handling both for old data,
which has `Span.Status` present, and new data without Status.
This will allow for older pipelines, still relying on `Span.Status` to continue to work with as little disruption as possible.
The exact duration of this transition period will be left for Collector's maintainers to decide.

* If incoming OTLP data has `Span.Status` present, that will NOT be removed and will be passed on as-is.
* If incoming OTLP data has `Span.Status` present and it is not equal to `OK`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a non-empty message also be translated independently of the status code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message is part of the Status, how can it be independent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it makes sense to add deprecated_status when "incoming OTLP data has Span.Status present and it is not equal to OK", but it may make sense to independently set deprecated_message when "incoming OTLP data has Span.Status present and message non-empty`"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having message without code will be confusing, because current state is exactly the opposite: you may have code without message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you may also have a message with an OK code. If you are worried about confusion, then set the status code additionally when the message is non-empty.

Collector will translate that to temporary semantic attributes as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the previous point not enough? Do we need to duplicate the status as attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because new OTLP receivers will not read status field from the wire. But we still may want to pass this information if it was available in the Collector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect code behind new OTLP receivers that don't read (deprecated_)status anymore to still need status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know, that is way we leave it there for the time being. Collector maintainers may decide when they want to remove this translation.


|Status|Tag Key| Tag Value|
|--|--|--|
|StatusCanonicalCode | `otel.deprecated_status_code` | Name of the code|
|Message *(optional)* | `otel.deprecated_status_message` | `{message}`|

* When receiving data in Zipkin format, no special handling will take place.
All Zipkin tags present will be translated verbatim to OTLP attributes.
* When exporting data into Zipkin format, no special handling will take place.
All OTLP attributes will be translated verbatim to Zipkin tags.
`Span.Status` if present, will NOT be translated.
* When receiving data in Jaeger format, no special handling will take place.
Jaeger `error` tag will be translated in the usual way to OTLP `error` attribute.
See `Open questions` below.
iNikem marked this conversation as resolved.
Show resolved Hide resolved
* When exporting data into Jaeger format, if a new temporary attribute `otel.deprecated_status_code` is present,
iNikem marked this conversation as resolved.
Show resolved Hide resolved
then Jaeger `error` tag will be set to `true`.
See `Open questions` below.

## Open questions

Neither OpenTelemetry Specification nor Protocol currently assigns any special meaning to the `error` attribute mentioned above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what happens if Status is removed but we do not (yet) have the "error"-like attribute. How will the errors be indicated? Do we lose this capability? Or the intent is to wait until "error" indicator (whatever it is) is first added to the specification and only after that we remove Status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current proposal is to remove status right now, file a follow up issue and work with Errors WG to resolve it. Either via one of the already existing proposals, like open-telemetry/opentelemetry-specification#822, or a new one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I see in this OTEP that you already thought through the transition process, which is great, but I need to mention this for visibility: we must be careful with Collector changes. There are people who use OpenTelemetry Collector in production already. All intermediate states of the Collector need to be fully functional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan It seems open-telemetry/opentelemetry-specification#822 will be effectively merged soon (the error.hint Span attribute at least). Would that be enough for the Collector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reluctant to give an approval until I clearly understand what we replace this by.

If open-telemetry/opentelemetry-specification#822 is the replacement can we wait for it be merged first and update this OTEP to make use of it? It is an open question in this OTEP and is one of the central problems that this OTEPs needs to solve, but remains partially unsolved for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the otep may change as a result of resolving any open question, but that doesn't mean we should stop this. I think the main point of this otep is the removing of span status and an action item is required to think about if we need an equivalent of error.hint or not separately.

Trying to make progress so we don't get stuck, but I can see your point as well, if you suggest that we really need a way to signal error from the application (which I don't believe we need) then I understand why we should block this until we have that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan can you express your opinion in open-telemetry/opentelemetry-specification#822? If you think that it may resolve your hesitation about this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu I think we do need a definitive way to signal errors on the span. Some products visualize this and OpenTracing has the convention for this, OpenCensus protocol has this built-in. I believe we need an equivalent in OpenTelemetry.

@iNikem I support "error.hint" and commented there and beleive the semantic conventions for it need to be merged asap to unblock this OTEP.

I am not against moving forward with this OTEP, but I believe it will have an important open question and will be unactionable in the Collector if it does not take into account either the existence of "error.hint" or the lack of whatever else needs to replace Span Status. IMO, it will mean we will need another OTEP to correct this OTEP once "error.hint" is merged or rejected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against moving forward with this OTEP, but I believe it will have an important open question and will be unactionable in the Collector if it does not take into account either the existence of "error.hint" or the lack of whatever else needs to replace Span Status. IMO, it will mean we will need another OTEP to correct this OTEP once "error.hint" is merged or rejected.

Not sure you need an OTEP for that, that is a semantic convention, and collector will use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu sure, that probably works too.

nor do they have any comparable functionality.
There are still ongoing discussions about that and this issue will be addressed later and separately.