Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining transition period for pre-stability HTTP semconv breaking changes #3362

Closed
trask opened this issue Apr 4, 2023 · 16 comments · Fixed by #3443
Closed

Defining transition period for pre-stability HTTP semconv breaking changes #3362

trask opened this issue Apr 4, 2023 · 16 comments · Fixed by #3443
Assignees
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory

Comments

@trask
Copy link
Member

trask commented Apr 4, 2023

Given the widespread adoption of OpenTelemetry HTTP semantic convention, and the extensive pre-stability breaking changes we are planning as part of ECS alignment, we are planning to provide a transition period to help give users and vendors time to adapt to these changes.

Here is an initial proposal from the HTTP semconv stability WG:

  • Step 1: Propose spec PR to align HTTP and network semantic conventions with ECS
  • Step 2 (in parallel with step 1): Prototype Java and Python instrumentation
    • Include a flag to emit the pre-ECS alignment semantic conventions
    • Not relying on schema translations since that doesn’t address attribute-based samplers, and also because of concerns raised around performance
  • Step 3: Merge ECS alignment spec PR and mark HTTP semantic conventions as frozen
    • (call this spec version X)
  • Step 4: Mark HTTP semconv stable after releasing instrumentation across 3 languages that conform to spec version X
  • Step 5: For the first 6(?) months after spec version X is released, any instrumentation package published under the OTel org that emits HTTP semantic conventions for spec version X, must include a flag that can be enabled to emit spec version X-1
    • After that time, the flag can be dropped from the instrumentation packages
@trask trask added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory labels Apr 4, 2023
@trask trask self-assigned this Apr 4, 2023
@reyang
Copy link
Member

reyang commented Apr 4, 2023

@vishweshbankwar FYI.

@jack-berg
Copy link
Member

This seems like a good plan. However, it is quite labor intensive, and I hope we don't feel like this type of process is necessary to make changes to any experimental semantic conventions. Hopefully http semantic conventions are the exception due to their wide use.

@mateuszrzeszutek
Copy link
Member

For the first 6(?) months after spec version X is released, any instrumentation package published under the OTel org that emits HTTP semantic conventions for spec version X, must include a flag that can be enabled to emit spec version X-1

Should the instrumentations emit exclusively X or X-1 attributes? Or should the configuration setting allow exporting all of them (e.g. both http.url and url.* attribute sets) to ease the migration?

@MSNev
Copy link

MSNev commented Apr 4, 2023

Include a flag to emit the pre-ECS alignment semantic conventions

I would recommend NOT doing this and instead introduce as a new major version (with breaking changes) or a new package and deprecate the previous (current) packages.

This is several fold

  • Total "Size" of the instrumentation code
  • Additional complexity of supporting this flag
  • Eventually this flag will become obsolete

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 4, 2023

I think this plan is too disruptive and does not give enough time for users and vendors to prepare for the change.

I would prefer to be more careful and instead do this:

  • Step 1: In semantic convention yaml files introduce a pair of keys effective_until_ver/effective_from_ver that can be optionally present for every convention.
    • The build-tools markdown generator must produce markdown that clearly shows the old (until) and new (from) conventions at the same time, with version numbers clearly show next to the convention.
    • The build-tools code generator must ONLY produce code for the semantic conventions that are already effective and ignore conventions that are not yet effective, based on the target version number that we want to produce.
  • Step 2: Propose spec PR to align HTTP and network semantic conventions with ECS. Set effective_from_ver: 1.26.0 (e.g. 6 minor versions/months from now). This means ECS conventions will NOT be produced by code generator if the target is set to 1.20.0 (the next version we will release) and for the 5 versions after that.
  • Step 3: Merge ECS alignment spec PR and mark HTTP semantic conventions. Merging this PR sets in stone our decision to switch to post-ECS conventions by a specific version (1.26.0).
  • Step 4: (in parallel with step 1 and 2): Prototype Java and Python instrumentation
    • Include a flag to emit the post-ECS alignment semantic conventions. Continue emitting the current (old) conventions by default.
    • The build-tools code generator may be used to produce code for 1.26.0 to be used in post-ECS emitting code when the flag is enabled.
  • Step 5: Once 1.26.0 spec is released change the instrumentation packages to default to new (post-ECS) conventions, but support a flag to emit old conventions.
    • After some time, the flag can be dropped from the instrumentation packages.

@cartermp
Copy link
Contributor

cartermp commented Apr 4, 2023

Is there a list of what exactly the breaking changes are? Transition period is most influenced by the scope of these changes. I see this table but it's unclear to me what is committed as a breaking change vs. what's just going to continue on with the existing name.

@lmolkova
Copy link
Contributor

lmolkova commented Apr 4, 2023

Is there a list of what exactly the breaking changes are? Transition period is most influenced by the scope of these changes. I see this table but it's unclear to me what is committed as a breaking change vs. what's just going to continue on with the existing name.

the prototyped ECS onboarding changes are in this draft and the full list is in the PR description #3355

@trask
Copy link
Member Author

trask commented Apr 4, 2023

@tigrannajaryan do you see a path forward that would allow us to declare HTTP semantic conventions stable at the start of the transition period instead of at the end of the transition period?

(this would allow people who are currently waiting on stable HTTP semantic conventions to move forward without needing to wait for the transition period to end)

@tigrannajaryan
Copy link
Member

do you see a path forward that would allow us to declare HTTP semantic conventions stable at the start of the transition period instead of at the end of the transition period?

@trask yes, I think we can declare them stable starting from version current+X where X is how much advance notice we want to give. We can still make the decision now and agree that it goes into the effect in the future.

(this would allow people who are currently waiting on stable HTTP semantic conventions to move forward without needing to wait for the transition period to end)

I think with my plan they do not need to wait. We can immediately go ahead and merge the PR that declares HTTP semantic convention stable and going into effect from 1.26.0, then we go ahead and implement the new conventions in instrumentation libraries and then anyone who wants to immediately use the new conventions can go ahead and enable the flag to opt-in early, without waiting for the new conventions to become the default.

To be clear: I don't insist on my particular plan, it is just one possible option. I am happy with any other approach that gives a similar amount of time for people who still use the old (current) conventions to prepare for the change.

@trask
Copy link
Member Author

trask commented Apr 11, 2023

I think there are two(?) reasons for the transition period:

  1. Give time for vendors to support the new HTTP semconv changes
  2. Give users a transition period during which both the old and new HTTP semconv are supported on the instrumentation side

To satisfy (1), it seems we need to have a time period (e.g. X months) between when the new HTTP semconv changes are merged and when instrumentations are allowed to emit them by default (an opt-in flag to emit them could be ok).

To satisfy (2), it seems we need to have an additional time period (e.g. Y months) after any given instrumentation initially starts emitting the new HTTP semconv during which there’s also a supported version of the instrumentation which emits the old HTTP semconv.

I think there are a couple of ways instrumentations can satisfy (2):

  • Support a flag to switch back to the old HTTP semconv
  • Patch (as needed, e.g. for security) the old instrumentation version which emits the old HTTP semconv

For Java we can commit to supporting a flag, but I think it would be good to have a lighter-weight option for languages that do not have full-time instrumentation maintainers.

I’m hoping that both (1) and (2) can be accomplished without requiring special semconv yaml changes and build tool support, since that would require a good amount of effort and may not be useful beyond this one application. This may require HTTP instrumentations to hard-code in the old semantic convention constants until they can drop support for those. (EDIT see comment directly below)

A couple extra thoughts/questions:

  • If we think we can reduce the time period for (1) to something small like 2 months, it could open up an additional option, e.g.
    • hold back all instrumentations from adopting latest semconv (just skip build-tools code generation) and go with the patching the old semconv instrumentation as needed (EDIT n/a due to comment directly below)
  • Should this transition period apply to all languages and all HTTP instrumentations? E.g. Rust HTTP instrumentation (which doesn’t exist yet), or some lesser used Python HTTP framework
  • the longer time period (1) is, the more users who will onboard to the old semconv

@trask
Copy link
Member Author

trask commented Apr 11, 2023

I’m hoping that both (1) and (2) can be accomplished without requiring special semconv yaml changes and build tool support, since that would require a good amount of effort and may not be useful beyond this one application. This may require HTTP instrumentations to hard-code in the old semantic convention constants until they can drop support for those.

I maybe wrong about this, I like @lmolkova's suggestion in #3355 to implement a deprecated flag in semconv yaml, which would allow the build-tools code generation to still generate the deprecated constants, and does have usefulness beyond this one application (e.g. currently in Java @jkwatson manually edits the constants file with each release to keep the @Deprecated constants)

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Apr 11, 2023

I maybe wrong about this, I like @lmolkova's suggestion in #3355 to implement a deprecated flag in semconv yaml, which would allow the build-tools code generation to still generate the deprecated constants, and does have usefulness beyond this one application (e.g. currently in Java @jkwatson manually edits the constants file with each release to keep the @Deprecated constants)

I think what I was suggesting earlier was a slightly more formal version of this. The deprecated flag does not tell when exactly the old conventions will be removed, so it will have to be accompanied by some commentary that explains how much time there is until the deprecated conventions are gone. It also does not tell when exactly the new conventions that replace the deprecated ones become the new default.

I think it is important to have this information recorded somewhere. If we don't want it to be formalized and added to yaml files I am fine with it (a text will probably do too).

To re-iterate what I am looking for:

  • A clear notice of X months that a convention will be removed. I think X~=6 is reasonable.
  • A clear notice of X months that a new convention will replace it.
  • Until the switch-over happens old conventions to remain the default. This is important, we must make sure that merging a PR with new conventions does not immediately break existing instrumentations. This was the primary reason I wanted to formalize this in yaml, to help prevent the mistakes.
  • Until the switch-over happens new conventions may be enabled by the end user using whatever means we want to give them.

Should this transition period apply to all languages and all HTTP instrumentations? E.g. Rust HTTP instrumentation (which doesn’t exist yet), or some lesser used Python HTTP framework

I think the answer should be yes. Backends typically don't try to change the behavior based on the source language or the popularity of the framework used.

@trask
Copy link
Member Author

trask commented Apr 11, 2023

I think X~=6 is reasonable.

in the PR I just sent #3381, I put X=3 months for vendors to support the new conventions, and Y=3 months on top of that for users to migrate.

this gives a minimum of 6 months before users MUST migrate

I'd prefer to increase Y over increasing X if possible, because the larger X is, the more first-time users will onboard to the old semconv and will need to migrate

@trask
Copy link
Member Author

trask commented Apr 12, 2023

@tedsuo proposed another alternative that I think is worth getting feedback on:

Whenever an HTTP instrumentation authored by OpenTelemetry adopts the new HTTP semconv, it SHOULD bump its major version.

The goal of this is to prevent auto-updates from users who have pinned to a major version, e.g. ^0.33, and to provide an extra clear signal to users who are updating that they should expect breaking changes (even though they should hopefully be expecting breaking changes already since all of the instrumentations they are using are marked unstable, e.g. 0.x, 1.x-alpha, 1.x-rc).

In practice, this would mean:

  • Python, Go (and others) use a single 0.x version for unstable artifacts and a single 1.x version for stable artifacts, and would need to bump 0.x HTTP instrumentation over to 1.x-beta or 1.x-rc.

  • Java would bump the Java agent from 1.x to 2.x, but the library instrumentation which is marked -alpha would just need to bump minor version since pinned prerelease versions only apply to a single minor version.

  • I'm not sure what it means for .NET which is currently at 1.0.0-rc9.14. Maybe we can defer to the .NET SIG who would be more knowledgeable about how their community uses version pinning and upgrading dependencies.

Then as far as transition plan goes

  • Vendors can publish which major versions they support
  • The prior major version can be supported (i.e. security patches) for some period to give users time to migrate

@cartermp
Copy link
Contributor

Not to complexify an already complex situation, but something that I think is missing here is some degree of accounting on the impact for tools/vendors that accept this data.

For example, some backends can handle this change pretty smoothly by "double writing" events -- when it detects one name, we also writes the other -- this keeps all existing queries, alerts, SLOs, etc. working when they're based on older attributes and allows them to work in the face of mixed data. And as instrumentation gradually moves to be all or nearly all based on the new version, data retention can kick in by aging out the old stuff and then there's no "mess" to go clean up later. But I doubt it will be that smooth for every backend. Are there categories of backend that are more/less impacted by this change? Are there any general guidance we can offer, or is that getting too out of scope?

It would also be good to get some degree of accounting on the impact this has for tools that sit in someone's observability pipeline. For example, a sampling proxy that lets you configure keys for sampling purposes will be impacted by this. Would an end-user need to make sure they upgrade all instrumentation across all services if they're doing tail sampling? Or is it more that end-users would need to write rules (assuming it's possible) that check for the existence of either key? Or could a tool "just handle it" somehow?

I realize the scope of OTel is explicitly just on the instrumentation front, but if we don't want to put the burden of migration on every end-user of OTel, I think we need to elaborate a lot more on how tools int he ecosystem can and should generally respond to these changes.

@puckpuck
Copy link

I fear this is going to cause issues with users, especially those in a complex micro services environment.

Several SDKs have been stable with http auto-instrumentation for quite some time. A microservice that isn't frequently updated has an SDK in place and will continue to work as is. Why fix it if it isn't broken? With this change we will now have a set of microservices using a new attribute name and others still on the old. This will break what attributes users are querying and displaying on dashboards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory spec:trace Related to the specification/trace directory
Projects
9 participants