Propose OpenTelemetry Profiling Vision (#212)

This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
open-telemetry · Sep 22, 2022 · 6e18af5 · 6e18af5
1 parent d226b67
commit 6e18af5
Showing 1 changed file with 188 additions and 0 deletions.
diff --git a/text/profiles/0212-profiling-vision.md b/text/profiles/0212-profiling-vision.md
@@ -0,0 +1,188 @@
+# Propose OpenTelemetry profiling vision
+
+The following are high-level items that define our long-term vision for
+Profiling support in the OpenTelemetry project that we aspire to achieve.
+
+While this vision document reflects our current desires, it is meant to be a
+guide towards a collectively agreed upon set of objectives rather than a
+checklist of requirements. A group of OpenTelemetry community members have
+participated in a series of bi-weekly meetings for 2 months. The group
+represents a cross-section of industry and domain expertise, who have found
+common cause in the creation of this document.  It is our shared intention to
+continue to ensure alignment moving forward. As our vision evolves and matures,
+we intend to incorporate our learnings further to facilitate an optimal outcome.
+
+This document and efforts thus far are motivated by:
+
+- This [long-standing issue](https://github.com/open-telemetry/oteps/issues/139)
+  created in October 2020
+- A conversation about priorities at the in-person OpenTelemetry meeting at Kubecon EU
+  2022
+- Increasing community interest in profiling as an observability signal
+  alongside logs, metrics, and traces
+
+## What is profiling
+
+While the terms "profile" and "profiling" can have slightly different meanings
+depending on the context, for the purposes of this OTEP we are defining the two
+terms as follows:
+
+- Profile: A collection of stack traces with some metric associated with each
+  stack trace, typically representing the number of times that stack trace was
+  encountered
+- Profiling: The process of collecting profiles from a running program,
+  application, or the system
+
+## How profiling aligns with the OpenTelemetry vision
+
+The [OpenTelemetry
+vision](https://opentelemetry.io/mission/#vision-mdash-the-world-we-imagine-for-otel-end-users)
+states:
+
+_Effective observability is powerful because it enables developers to innovate
+faster while maintaining high reliability. But effective observability
+absolutely requires high-quality telemetry – and the performant, consistent
+instrumentation that makes it possible._
+
+While existing OpenTelemetry signals fit all of these criteria, until recently
+no effort has been explicitly geared towards creating performant and consistent
+instrumentation based upon profiling data.
+
+## Making a well-rounded observability suite by adding profiling
+
+Currently Logs, Metrics, and Traces are widely accepted as the main “pillars” of
+observability, each providing a different set of data from which a user can
+query to answer questions about their system/application.
+
+Profiling data can help further this goal by answering certain questions about a
+system or application which logs, metrics, and traces are less equipped to
+answer. We aim to facilitate implementations capable of best-in-class support
+for collecting, processing, and transporting this profiling data.
+
+Our goals for profiling align with those of OpenTelemetry as a whole:
+
+- **Profiling should be easy**: the nature of profiling offers fast
+  time-to-value by often being able to optionally drop in a minimal amount of
+  code and instantly have details about application resource utilization
+- **Profiling should be universal**: currently profiling is slightly different
+  across different languages, but with a little effort the representation of
+  profiling data can be standardized in a way where not only are languages
+  consistent, but profiling data itself is also consistent with the other
+  observability signals as well
+- **Profiling should be vendor neutral**: From one profiling agent, users should
+  be able to send data to whichever vendor they like (or no vendor at all) and
+  interoperate with other OSS projects
+
+## Current state of profilers
+
+As it currently stands, the method for collecting profiles for an application
+and the format of the profiles collected varies greatly depending on several
+factors such as:
+
+- Language (and language runtime)
+- Profiler Type
+- Data type being profiled (i.e. cpu, memory, etc)
+- Availability or utilization of symbolic information
+
+A fairly comprehensive taxonomy of various profiling formats can be found on the
+[profilerpedia website](https://profilerpedia.markhansen.co.nz/formats/).
+
+As a result of this variation, the tooling and collection of profiling data
+lacks in exactly the areas in which OpenTelemetry has built as its core
+engineering values:
+
+- Profiling currently lacks compatibility: Each vendor, open source project, and
+  language has different ways of collecting, sending, and storing profiling data
+  and often with no regard to linking to other signals
+- Profiling currently lacks consistency: Currently profiling agents and formats
+  can change arbitrarily with no unified criteria for how to take end-users into
+  account
+
+## Making profiling compatible with other signals
+
+Profiles are particularly useful in the context of other signals. For example,
+having a profile for a particular “slow” span in a trace yields more actionable
+information than simply knowing that the span was slow.
+
+OpenTelemetry will define how profiles will be correlated with logs, traces, and
+metrics and how this correlation information will be stored.
+
+Correlation will work across 2 major dimensions:
+
+- To correlate telemetry emitted for the same request (also known as request or
+  trace context correlation)
+- To correlate telemetry emitted from the same source (also known as resource
+  context correlation)
+
+## Standardize profiling data model for industry-wide sharing and reuse
+
+We will design a profiling data model that will aim to represent the vast
+majority of profiling data with the following goals in mind:
+
+- Profiling formats should be as compact as possible
+- Profiling data should be transferred as efficiently as possible and the model
+  should be lossless with intentional bias for enabling efficient marshaling,
+  transcoding (to and from other formats), and analysis
+- Profiling formats should be able to be unambiguously mapped to the
+  standardized data model (i.e. collapsed, pprof, JFR, etc.)
+- Profiling formats should contain mechanisms for representing relationships
+  between other telemetry signals (i.e. linking call stacks with spans)
+
+## Supporting legacy profiling formats
+
+For existing profilers we will provide instructions on how these legacy formats
+can emit profiles in a manner that makes them compatible with OpenTelemetry’s
+approach and enables telemetry data correlation.
+
+Particularly for popular profilers such as the ones native to Golang and Java
+(JFR) we will help to have them produce OpenTelemetry-compatible profiles with
+minimal overhead.
+
+## Performance considerations
+
+Profiling agents can be architected in a variety of differing ways, with
+reasonable trade offs made that may impact performance, completeness, accuracy
+and so on. Similarly, the manner in which such a profiler might produce or
+consume OpenTelemetry-compatible data could vary significantly. As such, in our
+standardization effort it is not feasible to be prescriptive on the matter of
+resource usage for profilers.
+
+However, the output of OpenTelemetry's standardization effort must take into
+account that some existing profilers are designed to be low overhead and high
+performance. For example, they may operate in a whole-datacenter, always-on
+manner, and/or in environments where they must guarantee low CPU/RAM/network
+usage. The OpenTelemetry standardisation effort should take this into account
+and strive to produce a format that is usable by profilers of this nature
+without sacrificing their performance guarantees.
+
+Similar to other OpenTelemetry signals, we target production environments. Thus, the
+profiling signal must be implementable with low overhead and conforming to
+OpenTelemetry-wide runtime overhead / intrusiveness and wire data size requirements.
+
+## Promoting cloud-native best practices with profiling
+
+The CNCF’s mission states: _Cloud native technologies empower organizations to
+build and run scalable applications in modern, dynamic environments such as
+public, private, and hybrid clouds_
+
+We will have best-in-class support for profiles emitted in cloud native
+environments (e.g. Kubernetes, serverless, etc), including legacy applications
+running in those environments. As we aim to achieve this goal we will center our
+efforts around making profiling applications resilient, manageable and
+observable.  This is in line with the Cloud Native Computing Foundation and
+OpenTelemetry missions and will thus allow us to further expand and leverage
+those communities to further the respective missions.
+
+## Profiling use cases
+
+- Tracking resource utilization of an application over time to understand how
+  code changes, hardware configuration changes, and ephemeral environmental
+  issues influence performance
+- Understanding what code is responsible for consuming resources (i.e. CPU, Ram,
+  disk, network)
+- Planning for resource allotment for a group of services running in production
+- Comparing profiles of different versions of code to understand how code has
+  improved or degraded over time
+- Detecting frequently used and "dead" code in production
+- Breaking a trace span into code-level granularity (i.e. function call and line
+  of code) to understand the performance for that particular unit