Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define semantic conventions and instrumentation stability #2180

Conversation

tigrannajaryan
Copy link
Member

@tigrannajaryan tigrannajaryan commented Nov 30, 2021

This PR defines the rules for SDKs in all languages with regards to marking
instrumentations as "Stable" and "Unstable" and how such instrumentations
are allowed to evolve over time.

Summary of changes:

  • Explain that schemas define what changes are allowed.
  • Clarify that additive changes are an exception and are always allowed.
  • Prohibit instrumentation changes until April 1, 2023 to allow recipients
    (backends/vendors) to start understanding Schema URLs/Schema Files.
    This gives one year advance notice, which I think should be enough for
    all parties to prepare.

@tigrannajaryan tigrannajaryan requested review from a team November 30, 2021 23:52
@tigrannajaryan tigrannajaryan force-pushed the semantic-convention-stability branch 3 times, most recently from db79cb2 to 9fb4cf3 Compare December 1, 2021 00:05
specification/versioning-and-stability.md Outdated Show resolved Hide resolved
specification/versioning-and-stability.md Show resolved Hide resolved
specification/versioning-and-stability.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADDITIVE changes to metrics can break user alerts by creating new timeseries and adjusting thresholds. I think we need to be a lot more careful around changes, particularly around metrics here.

@tigrannajaryan
Copy link
Member Author

tigrannajaryan commented Dec 1, 2021

ADDITIVE changes to metrics can break user alerts by creating new timeseries and adjusting thresholds. I think we need to be a lot more careful around changes, particularly around metrics here.

@jsuereth I can change the PR to explicitly prohibit this but I think what you suggest contradicts with an existing understanding that attributes are optional, see this discussion #2077

I don't know which one is correct, I'll let metric expert to decide and then I am happy to adjust this PR according to that.

@tigrannajaryan
Copy link
Member Author

ADDITIVE changes to metrics can break user alerts by creating new timeseries and adjusting thresholds. I think we need to be a lot more careful around changes, particularly around metrics here.

@jsuereth I modified the PR to disallow addition of new attributes to existing metrics. PTAL.

@tigrannajaryan tigrannajaryan dismissed jsuereth’s stale review December 1, 2021 19:13

Changed the PR to avoid the blocking issue.

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for resolving my blocking comment! I still have some concerns, but not as strongly blocking :)

For clarity - I agree with the sentiment and goal of this PR and approve, just nits on details/scope of what's stated here, including implications on instrumentation development.

specification/versioning-and-stability.md Show resolved Hide resolved
specification/versioning-and-stability.md Show resolved Hide resolved
specification/versioning-and-stability.md Outdated Show resolved Hide resolved
@trask
Copy link
Member

trask commented Dec 10, 2021

Can you clarify how this applies to auto-instrumentation distributions?

Should we hide all telemetry producing auto-instrumentation in a distribution behind an experimental flag, and only enable them by default once their respective semantic conventions have reached stability and we can commit to telemetry stability?

@tigrannajaryan
Copy link
Member Author

Can you clarify how this applies to auto-instrumentation distributions?

Should we hide all telemetry producing auto-instrumentation in a distribution behind an experimental flag, and only enable them by default once their respective semantic conventions have reached stability and we can commit to telemetry stability?

No, I believe it is fine to produce telemetry from auto-instrumentation. This PR only says that after you declare your auto-instrumentation "Stable" you cannot change the produced telemetry until certain conditions are met (SchemaURL in the output and wait until certain date). If the text in the PR is unclear I will fix it. Let me know which part you find unclear.

@trask
Copy link
Member

trask commented Dec 10, 2021

Can you clarify how this applies to auto-instrumentation distributions?
Should we hide all telemetry producing auto-instrumentation in a distribution behind an experimental flag, and only enable them by default once their respective semantic conventions have reached stability and we can commit to telemetry stability?

No, I believe it is fine to produce telemetry from auto-instrumentation. This PR only says that after you declare your auto-instrumentation "Stable" you cannot change the produced telemetry until certain conditions are met (SchemaURL in the output and wait until certain date). If the text in the PR is unclear I will fix it. Let me know which part you find unclear.

For individual instrumentations, we use the version number to "clearly mark" them as unstable, e.g. 0.x or 1.x-alpha.

For distributions (e.g. the opentelemetry javaagent), should we be marking them as "unstable" (e.g. 0.x or 1.x-alpha) if they emit any unstable instrumentation?

@tigrannajaryan
Copy link
Member Author

For individual instrumentations, we use the version number to "clearly mark" them as unstable, e.g. 0.x or 1.x-alpha.

For distributions (e.g. the opentelemetry javaagent), should we be marking them as "unstable" (e.g. 0.x or 1.x-alpha) if they emit any unstable instrumentation?

The "unstable" vs "stable" labeling of a library or of a package (or individual instrumentations if that's the granularity of labeling you choose to maintain) is done based on the fact whether you are able to take the responsibility to fulfill the requirements of the "stable" labeling. What "stable" label means is described in this document.

Particularly with regards to telemetry emitted by the instrumentation this document (after accepting this PR) says that if you label the instrumentation "stable" then you must fulfil the following obligation:

  • If you are not including Schema URL in the emitted telemetry then you are not allowed to modify the telemetry (with certain exceptions carved out in the doc).
  • If you are including Schema URL then you are allowed to modify the telemetry after certain date, provided that they are described by a published Schema File (with the same exceptions as above applicable).

If you are not labeling the instrumentation as "stable" then you free to do whatever you want. Any changes are fine.

@trask
Copy link
Member

trask commented Dec 10, 2021

The "unstable" vs "stable" labeling of a library or of a package (or individual instrumentations if that's the granularity of labeling you choose to maintain) is done based on the fact whether you are able to take the responsibility to fulfill the requirements of the "stable" labeling. What "stable" label means is described in this document.

Thanks @tigrannajaryan. Given that we don't want to mark the entire javaagent distribution as "unstable", it sounds like we will need to disable all "unstable" auto-instrumentation in the distribution by default, and only enable them (by default) once their respective semantic conventions have reached stability and we can commit to the telemetry stability described in this document.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of questions about metrics here. The Views support in each metrics SDK should show it's possible to perform attribute removal safely (correctly, and relatively easily compared with doing so after-the-fact). This topic came up in #2208. I expect we will eventually be able to support metrics schema translation via Views and relax these restrictions, but it's good to be conservative for now.

@tigrannajaryan
Copy link
Member Author

@Oberon00 please take another look.

@tigrannajaryan
Copy link
Member Author

@open-telemetry/python-maintainers can you please take a look? This affects instrumentations in all languages. It would be good to have your approval if you agree with this approach.

@tigrannajaryan
Copy link
Member Author

@dyladan if the PR looks good to you please approve on behalf of JS SIG. I want to have an approval from each (or at least most) language SIG.

Copy link
Member

@dyladan dyladan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this at the JS meeting last wednesday and the other @open-telemetry/javascript-maintainers brought up the same points as have been brought up by myself, @alanwest, and others around instrumentation which is not in the semantic conventions. There will always be some instrumentations which for whatever reason are not covered by the semantic conventions whether they have too small of a userbase, are too specific, or provide some custom functionality. This PR seems to prevent those instrumentations from ever being marked as "stable" which will limit their adoption.

From a maintenance standpoint it is not too big of a burden for maintainers to gate stable on the guarantees outlined here, but I believe the restrictions will prevent people from upstreaming contributions to the contrib repo unless we have some way to provide custom functionality that isn't covered by the semantic conventions.

As mentioned earlier, this is still marked experimental so isn't blocking, but I want to make sure the point isn't forgotten before this is finalized and made stable.

@tigrannajaryan
Copy link
Member Author

From a maintenance standpoint it is not too big of a burden for maintainers to gate stable on the guarantees outlined here, but I believe the restrictions will prevent people from upstreaming contributions to the contrib repo unless we have some way to provide custom functionality that isn't covered by the semantic conventions.

As mentioned earlier, this is still marked experimental so isn't blocking, but I want to make sure the point isn't forgotten before this is finalized and made stable.

Make sense. I will think about we can make this explicitly allowed. To be clear, even in this current form we do not completely prevent Stable implementations which break the rules, since I used the "SHOULD" clause, so there is always a possibility to ignore it if there is a good reason. Anyway, I will file a github issue to take care of this, if this PR gets merged (I am not yet sure, so I don't want to open an issue about a document that doesn't exist in the spec yet).

@tigrannajaryan
Copy link
Member Author

I believe so far we have approvals from JS, Go, Java, .Net maintainers.

@trask
Copy link
Member

trask commented Apr 4, 2022

There will always be some instrumentations which for whatever reason are not covered by the semantic conventions whether they have too small of a userbase, are too specific, or provide some custom functionality. This PR seems to prevent those instrumentations from ever being marked as "stable" which will limit their adoption.

@tigrannajaryan I thought there were two options for instrumentation that are not covered by semantic conventions to go stable:

  • Fixed Schema Telemetry Producers - decide not to evolve the schema via the standard schema mappings
  • Schema-File Driven Telemetry Producers - publish your own schema file (we have discussed doing this in Java for these one-off instrumentations that do not have a future in the specification repo)

@tigrannajaryan
Copy link
Member Author

There will always be some instrumentations which for whatever reason are not covered by the semantic conventions whether they have too small of a userbase, are too specific, or provide some custom functionality. This PR seems to prevent those instrumentations from ever being marked as "stable" which will limit their adoption.

@tigrannajaryan I thought there were two options for instrumentation that are not covered by semantic conventions to go stable:

  • Fixed Schema Telemetry Producers - decide not to evolve the schema via the standard schema mappings
  • Schema-File Driven Telemetry Producers - publish your own schema file (we have discussed doing this in Java for these one-off instrumentations that do not have a future in the specification repo)

@trask that is correct. And for both options they should also follow this rule:

Stable instrumentations authored by OpenTelemetry SHOULD NOT produce telemetry that is not described by OpenTelemetry semantic conventions.

I think @dyladan's concern is that this is too much limitation and the rules are too rigid for instrumentations contributed by non-core maintainers.

@trask
Copy link
Member

trask commented Apr 4, 2022

And for both options they should also follow this rule:

Stable instrumentations authored by OpenTelemetry SHOULD NOT produce telemetry that is not described by OpenTelemetry semantic conventions.

oh, I missed this, thx! I agree this is too limiting for the reason @dyladan mentions:

There will always be some instrumentations which for whatever reason are not covered by the semantic conventions whether they have too small of a userbase, are too specific, or provide some custom functionality

Summary of changes:

- Explain that schemas define what changes are allowed.
- Clarify additive changes are an exception and are always allowed.
- Prohibit instrumentation changes until July 1, 2022 to allow recipients
  (backends/vendors) to start understanding Schema URLs/Schema Files.
@tigrannajaryan
Copy link
Member Author

I believe so far we have approvals from JS, Go, Java, .Net maintainers.

In addition we now have approvals from Ruby and Python maintainers.

I believe this is good enough and we can move forward with this approach. Merging the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.