Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the interpretation of SpanKind #337

Merged
merged 6 commits into from
Nov 4, 2019

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Oct 24, 2019

This adds clarification to the SpanKind field values.

@tigrannajaryan
Copy link
Member

@jmacd IMO StatusKind is useful precisely because it is narrowly defined and allows backends to learn the semantics of operation and visualize it differently based on the value of the SpanKind.
If this becomes an arbitrary string value unclear how the backends will implement the equivalent functionality (perhaps by having semantic conventions around SpanKind strings, but then what's the value of moving from enum to a list of pre-defined strings if any user-defined value is unknown to the backend?).

Perhaps there is a specific use case which you have in your mind that is not solvable by the status quo, in that case please describe it. The use case you wrote "This would allow users to perform offline analysis on spans by other designated kinds" seems achievable by just adding a custom attribute to the Span.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Request change" as a way to block the PR.

It is not clear what extra scenarios this opens comparing to custom attribute on a span. It seems like too big of a change for 1.0 milestone and it will negatively affect memory, perf of runtime and spans protocol.

@jmacd if you feel that this is a needed change, let's initiate it from the OTEP and specific scenarios it opens up

@jmacd
Copy link
Contributor Author

jmacd commented Oct 24, 2019

allows backends to learn the semantics of operation and visualize it differently based on the value of the SpanKind.

@tigrannajaryan there is no loss of value in this regard. There are still four well-known kinds that you can use to tell a backend the semantics / visualization strategy. Backends would continue to recognize those four values and use special visualizations, but backends could simply display the Kind string for other unrecognized values and provide value to the user.

Disallowing the use of any other kind of Span prevents users from naturally expressing their system. If they can't use a standard SpanKind value (b/c their use doesn't match one of the four), then they're forced to use a separate span attribute, which will not display any useful information when rendered. Simply put, if my use is not one of those four, there is still value in displaying that this is an "ingress" span, a "background" span, a "watchdog" span, etc. There are millions of useful descriptions for the kind of span that can be imagined, and if SpanKind can be a string, my system can display those to me prominently.

@tigrannajaryan
Copy link
Member

@jmacd I see your point. I agree that users may have span kinds that don't fit into the 4 categories we defined and we will be artificially forcing them to choose one or leave it unspecified and that may be undesirable in situations when they actually know and want to specify some specific kind.

My concern with relaxing the specification would be that the users may put all sorts of unexpected string values in the span kind even when one of the predefined values would be a good match and thus end up with less accurate and less valuable data. Perhaps this is not a big deal, especially if the API encourages the use of one of 4 predefined values and requires a slight extra effort to specify a custom one.

@jmacd
Copy link
Contributor Author

jmacd commented Oct 24, 2019

I would be happier if we removed SpanKind from the spec and specified a conventional span.kind attribute, by the way.

@tigrannajaryan
Copy link
Member

I would be happier if we removed SpanKind from the spec and specified a conventional span.kind attribute, by the way.

I agree, it may make more sense.

@SergeyKanzhelev
Copy link
Member

In OpenTracing is was defined as one of 4 strings, not an arbitrary string:

Span tag name Type Notes and examples
span.kind string Either "client" or "server" for the appropriate roles in an RPC, and "producer" or "consumer" for the appropriate roles in a messaging scenario.

In OpenCensus it was added as a replacement of Recv. and Sent. prefixes in span names: census-instrumentation/opencensus-proto#51 And originally only had Client and Server values. Producer and Consumer were added on projects merge.

So I'd question the desire to "unseal" the values limit as neither of initial projects had this scenario.

Now w.r.t. attribute vs. field - for me there are two considerations:

  1. if this is a field that is required for the proper work of backends - it is beneficial to have a special field rather than risk that other attributes will push this attribute out of the collection.
  2. If in many apps there will be a processing (sampling, aggregation) in-proc or in collector that relies on span kind - it is beneficial to have it strongly typed.

@bogdandrutu
Copy link
Member

@tigrannajaryan @jmacd I am not sure what an open form will give us in the future - we can always add new types if necessary, there is no real benefit to have this as a free from string. I can see only downsides in having it as a free form string - misspells, backends without proper support because a user just invented a new kind, etc.

@SergeyKanzhelev I completely agree with your logic and arguments.

I think the decision was made to keep this as a first class citizen of the Span during the merge discussions and we cannot revert that now until we deprecate opencensus and opentracing for backwards compatible reasons.

@jmacd
Copy link
Contributor Author

jmacd commented Oct 25, 2019

It seems this field is not required for a backend to operate correctly, that it's just a hint. @SergeyKanzhelev In the OpenTracing document, those are "Notes and Examples". Nothing says you can't use any string.

SpanKind is also ambiguous. What should I use if I'm both a server and a producer?

And it's meant for compatibility with a hint for OpenCensus? This is a lot to make up for backward compatibility, when we could just use a span attribute and leave it out of the SpanData.

I don't see how we (as vendors) benefit from this annotation, but as a user being limited to exactly four values makes it feel not very useful.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on enum vs. string. In OpenTracing span.kind tag was a string.

I am not sure if span kind is much more important that all the other data elements for which we define semantic conventions. In strongly typed languages it is easy to make the API accept a typed constant, which prevents people from freely putting garbage values, while still being extensible.

* `SERVER` Indicates that the span covers server-side handling of an RPC or
other remote request.
* `CLIENT` Indicates that the span describes a request to some remote service.
other request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"remote" seemed relevant, why was it removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to prevent a service from calling itself. Why does the "remote" aspect matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because semantically we (I think) want "client" kind to mean "network call", not in process function call.


`SpanKind` is associated with the activity that started the span,
which helps resolve ambiguity. For example, a server span that
directly calls another server could be described as both a SERVER and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of such example in the spec, because it is against the traditional instrumentation guidelines that this scenario should have both server and client spans in that same service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that example to illustrate what I see as an ambiguity and a problem with SpanKind generally. I am not familiar with the guidelines you refer to, but it feels odd if we're suggesting that a server should start a child span in order to issue an RPC. This just raises the cost of tracing.

As far as I've been able to determine, there are a few uses for SpanKind. Currently, I believe SpanKind is ambiguous at best, so I'd like a better understanding.

One thing which was part of OpenTracing was this ChildOf vs FollowsFrom distinction. I believe the choice of "producer/consumer" is associated with FollowsFrom and that "client/server" is associated with ChildOf. The important question is should the tracing system assume that the child span is going to complete before the parent span. At LightStep we call client/server spans "well-formed", whereas FollowsFrom spans are not. We can't use FollowsFrom spans for critical path analysis.

Another thing I see SpanKind being used for is to detect spans which either start with a remote parent or are themselves a remote parent. At LightStep we refer to these as "ingress" and "egress" spans. They are more interesting for monitoring purposes than internal spans.

My understanding is that SpanKind is sort of doing both of these things, telling us whether a child should complete before its parent and whether a parent and child used context propagation rather than an internal relationship.

I don't see SpanKind actually accomplishing these things, unless the guidelines you refer to are met. I would prefer if we had less ambiguous ways to do this, which do not require extra spans where they are not necessary.

Some solutions that I would prefer.

Spans could count the number of times their context is Extracted. Spans with >0 extracted contexts ought to have remote children. These are "egress" spans. Spans that are started from an Injected context ought to have a remote parent. These are "ingress" spans.

The ChildOf vs FollowsFrom distinction ought to be a Link property, since a span can have multiple children. We do not currently use a Link to store the parent span relationship, but conceptually each Link should have an attribute to say whether it is a well-formed child or an asynchronous one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels odd if we're suggesting that a server should start a child span in order to issue an RPC. This just raises the cost of tracing.

Starting a child "client" span for outbound network call is how every instrumentation I ever came across operates. If you don't do that, you cannot add any tags describing the outbound call either, because they would end up on the parent "server" span where they don't belong.

The cost of tracing can be controlled by not recording those spans, but if instrumentation doesn't even create them then it is simply not describing the semantics of distributed transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, with this guideline in place the meaning of SpanKind makes sense. This guideline should be included in the explanation for SpanKind. I haven't seen it written down anywhere.

I think a server with high fanout shouldn't be required to create a child span for each call it makes. I could add an Event to the parent span to describe the call being made. The child span will have the tags I'm interested in, and I don't feel any semantics are lost.

I will update this PR with the guideline and explain how to interpret these values.

Copy link
Member

@yurishkuro yurishkuro Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a server with high fanout shouldn't be required to create a child span for each call it makes.

Yes, high fanout does occasionally present problems (e.g. 100s of calls to Redis). If the application itself knows about it and is conscious of the performance overhead of those tiny spans, then it makes sense for it to just log events. But if the API itself is efficient enough, then the optimization may still be done by the tracer, rather than the application, e.g. by combining some of those spans into summaries.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 25, 2019

I do agree that we should have an enumerated list of values for span.kind, and new values should not be getting invented left and right. But I would like to understand how this data is intended to be used, in order to understand how it would be extended, if at all.

For span.kind, It's not totally clear to me why this needs to be an enum, instead of a semantic definition like all of the rest. It would be mildly more efficient to use an enum, a little bit harder to mess up. But it would also more extensible as a string value - with other semantic definitions, it is possible to model new keys and values without forking the API. With an enum, we would have to fork/version the spec every time we make a change. Part of the point of the semantic definitions is that we don't have to do something that drastic every time we want to model a new concept. Constants, helpers, etc, for the semantics are only added to the spec as a convenience.

Personally, with span.kind, I think it is a bit of a bikeshed either way. :) The arguments for how difficult it would be to get the value correct if it we used strings feel a little overblown - after all, the rest of the semantics are going to work this way. At the same time, I don't see us adding values to to span.kind very quickly, if at all. So I'm a bit 🤷‍♂ on all of this in general.

I would prefer to have a clearer definition of the purpose of span.kind. Are we just trying to denote serialization/propagation points in our trace? In that case, three values are enough: sender, receiver, and internal (call them what you will). I don't know that we need to be getting fine grained with this particular type system - i'd rather just see more semantic definitions for various types of span operations. In that sense, the producer and consumer values feel like half measures - what use do they provide?

@jmacd jmacd changed the title Loosen the SpanKind specification to allow string values Clarify the interpretation of SpanKind Oct 25, 2019
@jmacd
Copy link
Contributor Author

jmacd commented Oct 25, 2019

Please take another look.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Thank you for adding clarification for a SpanKind

the parent is expected to wait for it to complete under ordinary
circumstances. It can be useful for tracing systems to know this
property, since synchronous Spans may contribute to the overall trace
latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some RPC frameworks (well, not really 'P' as in 'procedure) support one-way calls, that may or may not be acked on delivery, but certainly do not provide for a result to be returned to the caller (e.g. firing off a UDP message).

What span_kind should the caller use in that case? Producer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this question points out that SpanKind is still flawed.

The question is whether these two should be independent or not. Can you be a CLIENT/PRODUCER? or a SERVER/CONSUMER? I.e., can you declare that there is a remote parent/child and an asynchronous relationship?

Can you be an internal span that has a PRODUCER/CONSUMER relationship with its child? These are capabilities we found in OpenTracing via ChildOf and FollowsFrom.

I am still skeptical of this field, but at least in the current state of this PR my understanding of the intention is closer to the group's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that this PR improves current spec, but not addresses the problem of SpanKind expressiveness. Similar to #65 we just stalling on decision as not many consumer uses this information to the full extent.

For me - important thing is to distinguish incoming, internal and outgoing calls. As we had in OpenCensus. Mostly for the in-proc aggregation and analysis.

@SergeyKanzhelev
Copy link
Member

@open-telemetry/specs-approvers this PR makes spec better by explaining the intent behind SpanKind. Not semantics change. Please review.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 1, 2019

@SergeyKanzhelev OK :) . Maybe we should remove PRODUCER/CONSUMER then.

I could imagine using bitwise logic to express more kinds of span, too. The dimensions are:

CLIENT/SERVER/INTERNAL these are exclusive properties to indicate whether a remote parent/child or not. one of these values is required to be set. These I perfectly agree with.

Now consider this proposal:

SYNC/ASYNC these are options, exclusive with each other, to indicate synchronicity, one of these may be set in addition to CLIENT/SERVER/INTERNAL.

That leaves 6 possible span kinds, but it doesn't quite help with the problem as I see it, and the problem is that we're mixing descriptions that apply to the parent and to the child.

In the existing spec, PRODUCER says "I will not wait for my child", CONSUMER says "My parent will not wait for me". One describes the relationship to a child, one describes the relationship to a parent.

If I'm an INTERNAL span, should I prefer to annotate the relationship with my parent, or my child? What if I want to express "I will not wait for my children AND my parent will not wait for me"? (It's a producer and a consumer.)

This leads me to think that the two options are really not exclusive, so consider this proposal:

ASYNC_PARENT: I will not wait for (all) my children
ASYNC_CHILD: My parent will not wait for me

This leads to 9 possible span kinds

CLIENT
CLIENT/ASYNC_PARENT
CLIENT/ASYNC_PARENT/ASYNC_CHILD
INTERNAL/... (3 total)
SERVER/... (3 total)

All of this leaves me feeling like the OpenTracing FollowsFrom concept was spot on, which is ASYNC_CHILD in the above proposal. Maybe we don't need ASYNC_PARENT.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 1, 2019

^^^ That said, I'd be happy to merge this and continue working on it.

@yurishkuro
Copy link
Member

@jmacd I like your suggestion of separating the different semantic bits. However, if we do that, I would prefer the following

  • instead of client/server, use egress/ingress/internal
  • do not mix async-ness into span kind at all

For critical path analysis I don't think it is sufficient to know simply sync/async flag. A span represents a pair of events, start and end. The causality relationships are easy to define in terms of events (via "happens before" relationship), but not as easy between pairs of events. I would argue that knowing causality is more important than knowing sync/async nature, as the latter can be inferred from the former.

Microsoft Project has long used 4 types of "task links" to define relationships between tasks, like start-to-finish, finish-to-finish, etc. I am not sure if those are sufficient to express all possible happens-before combinations between parent start/end and child start/end. Ideally, we would come up with a nomenclature that could also be applied to Span Links, which have an additional uncertainty about the start events (in case of parent/child spans, we always know that parent-start happens-before child-start).

NB: I don't have objections to merging the current PR, but I think we need to address the causality uncertainty.

@yurishkuro
Copy link
Member

PS: if ingress/egress terminology might be confusing, another one is inbound/outbound/internal.

@yurishkuro
Copy link
Member

yurishkuro commented Nov 3, 2019

Another interesting span kind could be "proxy", which is both inbound/outbound. As the spec if phrased now, it's not clear what span kind would, say, a sidecar generate (for sidecar it does not make sense to be generating a pair of server/client spans).

@bogdandrutu
Copy link
Member

@yurishkuro I proposed the same thing some time ago, load_balancer fits into the same category.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 4, 2019

@yurishkuro I feel like you've contradicted yourself, here #337 (comment) and here #337 (comment)

I'd be happy to iterate on this, let's get this PR merged.

Co-Authored-By: Sergey Kanzhelev <S.Kanzhelev@live.com>
@yurishkuro
Copy link
Member

@yurishkuro I feel like you've contradicted yourself...

I don't think so. #337 (comment) refers to spans created by a regular microservice. Even if it has 1-1 inbound/outbound spans, the purpose of those spans is generally different. But in case of a proxy or load-balancer, I don't consider its outbound spans as "calling a dependency", nor its inbound spans as "handling an endpoint". One can, of course, model them that way, but in practice nobody needs such overhead and proxies usually create just a single span (or sometimes no span at all, just enrich the existing one from the service through post-processing merge). But it does need a different span kind to represent that (very common for proxies) behavior.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 4, 2019

Thanks.

@jmacd jmacd deleted the jmacd/loosen_span_kind branch November 25, 2019 19:19
SergeyKanzhelev added a commit to SergeyKanzhelev/opentelemetry-specification that referenced this pull request Feb 18, 2020
* Loosen the SpanKind specification to allow string values

* Updates

* Updates

* Updates

* Update specification/api-tracing.md

Co-Authored-By: Sergey Kanzhelev <S.Kanzhelev@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants