Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span Status Improvements (without errors) #133

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 238 additions & 0 deletions text/0131-improve-span-status-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
# Improve Span Status API

Allow the Span Status API to represent more kinds of status

## Motivation

Right now OpenTelemetry Status is defined as an enumeration of gRPC status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not gRPC :)) I think I keep saying this. They are Google's codes which are used in gRPC but they are used in http APIs as well that Google expose.

codes. Although I couldn't find design criteria written down for this API
I fear it is too narrowly defined to be useful across the full breadth of
scenarios OpenTelemetry targets.

OpenTelemetry allows Spans to be created to represent any operation, including
those which don't involve communication with another component (Kind =
Internal). These underlying operations can have native status representations
from a particular domain or a language such as POSIX status codes, HRESULTs,
HTTP status, or gRPC status.
However to capture the status as part of an OpenTelemetry span it must first be
mapped to something in OpenTelemetry's object model and this mapping has the
potential to create a few problems:

- **Inconsistency** - If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK
implementations are unlikely to choose the same mapping. This makes collected
data hard to work with because it can no longer be treated uniformly.
- **Loss of fidelity** - Mapping from a status representation that
distinguishes a large number of different results to one that only
distinguishes a few is inherently lossy. This prevents users from isolating
different status results they care about. It can also prevent UI from showing
useful status information because users don't relate to the reduced
representation and the transformation isn't reversible.
- **Conversion difficulty** - If the conversions are non-trivial then they
are unlikely to be implemented correctly or perhaps at all. Past feedback
suggests end users want to spend little to no effort on this task. SDK
implementers may be more diligent but are constrained to native status
representations that are known a-priori.

These are challenges for any design of the status API, not solely the current
one. We will need to evaluate these issues as a matter of degree and make a
judgement call about what is sufficient.

### Goals

In order to determine what status information is useful we first need a common
understanding what we expect the collected data to be used for. These are
tasks I anticipate tool vendors and end-users would like to be possible with
OpenTelemetry status information:

1. **Viewing** - Developers diagnosing a specific distributed trace want to
understand the status of spans that occurred while it ran. To do this they want
to see status information annotated in the trace, ideally with progressive
levels of detail as the focus of investigation narrows. The viewed status
information should be easy to correlate back to the domain that generated it
and diagnostically useful.

2. **Searching/Filtering** - Developers suspect a particular status condition
might be occurring due to customer feedback, some behavior they observed
locally, in another trace, or perhaps code review. They want to search
collected telemetry to determine if and how often that status condition
occurs. If it occurs they want to explore example traces producing it to
better understand when it manifests and how it impacts their system. The
search terms should be intuitive given the developers initial knowledge
about the status condition they are searching for and the domain it arose from.

3. **Grouping and metrics** - Developers monitoring a service want to
understand what kinds of status conditions are occurring and how frequently.
To do this they want the monitoring tool to provide UI that buckets results
into categories and show the top categories ranked by count of occurrence.
They may also want metrics that track category counts over time to identify
trends and deviations from the trend. Many useful groupings are based on
sharing a common problem symptom or common root cause, but other more coarse
groupings may be useful in trend analysis. At the most coarse spans can be
divided into some definition of "successful" and "failed" but there is no
consensus on how some status results should be bucketed (for example http
4xx results).

### Scope

I hope to define only the tracing API here, not an SDK API nor a wire
protocol. I assume that a simple SDK API will follow relatively automatically
if we first have agreement on the tracing API and OpenTelemetry can still be
successful in an initial release without having standardized a wire protocol.

The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status. If those sub-operations
aren't represented with a child span then it is out of scope how their status is
captured.

## Explanation

#### Data representation of Span Status

We should aim for representations that require minimal-to-no end-developer
translation effort from the native representation and capture at least the key
numeric/string data developers are likely to search for or relate to. I suggest
status is this set of information:

1. Domain - The name of the domain the status data applies to such as "HTTP",
"gRPC", "HRESULT", "POSIX". The list is end-user-extensible but common status
type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
Comment on lines +98 to +101
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google did the same thing, and at the end they decided to remove this and support only what they called "canonical status" because custom translation were hard to keep track of.

One thing that was very important was the notion of "IsOk" on the Status itself, every domain has it's own codes, but in a lot of time was important to know if it is ok or not :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Google canonical codes are a good idea, and am generally favorable towards Span.status as it currently exists. The main issue is that we don't have guidance on how to map from different domains onto the google canonical space. If we could provide specification that explains how to translate between domains, I think it'd be easier to make the argument that we should keep Span.status as is.

2. Code - An integer status code. Can be combined with an status message. Either
a code or message are required.
3. Message - A string status message. Can be combined with a status code. Either
a code or message are required.

Although UI creators are free experiment with how the data is presented I
expect most presentations would either be the StatusData alone, or the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatusData is not defined up to this point. What is it?

StatusData qualified with the StatusType and some separator character. For
example StatusData alone might create names like "503", "E_FAIL (0x80004005)",
"Status Code 12: Unimplemented".

#### API to capture this data

I would suggest an API called Span.SetStatus(...) that takes all the arguments
above, and optionally overloads or default parameters that make common calls
easier. For example in C#:

```C#
void SetStatus(string domain, int code, string message);
void SetStatus(string domain, string message);
void SetStatus(string domain, int code);
```

## Internal details

I expect a basic and reasonable implementation would be to define fields on the
Span object, set them in the setter and then implement some getter for the
exporters to use. A second possibility is to store the data into attributes or
events. The choice of storage may have some modest effects on memory usage
but primarily I expect the choice would be driven by the SDK API we want to
read stored data back.

## Trade-offs and mitigations

As mentioned in the motivation section, the issues of inconsistency, loss of
fidelity and conversion difficulty are all on a sliding scale. I expect this
design improves each of these issues at the expensive of some increased
object-model/API complexity. Making the OpenTelemetry status representation
more expressive also could cause inconsistency problems for the opposite reason

- now there might be multiple reasonable representations for a status and the
user becomes unclear which one to pick. In general I expect this to be mitigated
with documented conventions, API defaults, and the potentially canonicalizing
data anywhere in the processing pipeline.

The current proposal only gives one string which can be used either for a
freeform message or a textual status code. Adding a 2nd string to the
StatusData would allow both to be collected side by side. This is another example of
increasing expressivity at the cost of some complexity. I'd be happy to see
this added if the community agreed.

If all the data collected by the API is transmitted to the back-end this also
increases the size of transmitted telemetry, but the exporter authors always
retain the freedom to drop or reduce any information they don't believe is
valuable with potentially a slight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut sentence?


## Prior art and alternatives

#### Prior discussion

There have been a few past attempts to make improvements here:

- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123)
- https://gitter.im/open-telemetry/error-events-wg

#### Alternatives

**Status information using logging** - It is possible to collect status and
error message via log messages and correlate it to a span via trace context.
While I have no objection to improved logging capabilities or better
integration between distributed tracing and logging I don't believe this is
a problem that distributed tracing should abdicate for various reasons:

- Logging isn't a simple or cheap dependency to take in the implementation.
While OpenTelemetry may solidify a logging offering for API and SDK components,
an end-to-end scenario still requires snooping the log stream to determine
relevant messages for a given span. Regardless whether this is done
client-side, at the database or in the UI layer it potentially involves
performance overhead of handling an order of magnitude more data.
- It requires establishing conventions that designate which log message
represents 'the status' for a Span rather than one of potentially many results
or errors that were recorded during the Span's duration. This likely means this
status is a special case on the logging API if it isn't a special case on the
distributed trace API.
- I expect developers both at the time they are emitting trace data and when
viewing that trace data would find it is idiomatic to include the status of
the Span's workload together with the description of the Span. Identifying
failed or abnormal spans is a typical APM operation.

**Closed vs. open-ended status descriptions** - We could make status
represented by a fixed set of options such as the current gRPC codes
or any larger/different set. It likely is simpler for tools consuming
the data to only operate on fixed options and it is pleasant to imagine that
all (or most) status results will be reasonably categorized into the fixed
list. However I do not expect most users to take up the responsibility of
mapping their status results if OpenTelemetry/back-end tools don't provide
a clear representation. Instead these users will not set status or choose
options that are generic and inconsistent rendering the data useless to
everyone. For a fixed list to work I believe we either have to curate a very
large list of options or accept that significant portions of the developer
audience will not find the status representation useful.

**API using attribute semantic conventions** - It is also possible to do this
via semantic conventions on attributes and although I think the strongly typed
API is preferable I don't have that strong of an opinion. Semantic conventions
are likely to have higher performance overhead, higher risk of error in key
names and are less discoverable/refactorable in IDEs. The advantages are that
there is some past precedent for doing this specific to http codes and new
conventions can be added easily.

**API using Span event semantic conventions** - Most of the rationale for attribute semantic
conventions also applies here, events are effectively another key-value store
mechanism. The timestamp that is attached to an event appears to hold little
value as status is probably produced at the same approximate time the span end
timestamp is recorded.

**Move the API to a non-core package** - It is possible to have the Tracing
API expose status using Attribute or Event APIs, and then have a 2nd library
that exposes a strongly-typed API that converts the status information into
attrbute/event updates. This adds some complexity over a strongly typed API
declared directly on Span but if we identify this as an area that needs to be
more decoupled/versionable than other Span tracing APIs perhaps it would be
valuable.

## Open questions

1. Although I specified that common status types could be given standardized
names, I didn't define what that list was. We would need to define what
criteria makes a status type common enough to be on the list and maintain it
over time.
2. Above the design mentioned that the SDK might fill in the name or integer
value of a well-known status code when only one of the two was specified by a
user. We'd have to decide if that is functionality we want, and which values
are included in mapping tables.
3. I left the SDK API out-of-scope, but we will need a way to retrieve the
stored data via the SDK before adding data to span has any value in an
end-to-end scenario.