Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to record multi-operation/table/dbs operations on DB metrics #805

Closed
lmolkova opened this issue Mar 10, 2024 · 6 comments
Closed

How to record multi-operation/table/dbs operations on DB metrics #805

lmolkova opened this issue Mar 10, 2024 · 6 comments
Assignees

Comments

@lmolkova
Copy link
Contributor

For a simple database queries (such as SELECT * from foo where bar="baz"), we'd like to capture the following attributes when possible (on spans):

  • db.operation.name = SELECT
  • db.collection.name = foo (aka db.sql.table and other similar system-specific attributes)
  • db.collection.namespace = mydb (aka db.name along with db.instance.id and similar)
  • db.query.text = SELECT * from foo where bar=? (aka db.statement)
  • db.query.parameter.bar=baz

(attribute names are being discussed and are not final).

This simple case is supported by current version of DB semantic conventions. It's also a common one in the NoSQL world if we exclude bulk operations (or non-homogeneous batch operations).
Attributes (except db.query.* ones) have reasonable cardinality and can be used on traces and metrics.


More complicated queries involve multiple operations, tables, or even databases. E.g. in SELECT * from foo JOIN bar ON baz
we have two operations (SELECT and JOIN), two tables (foo and ``bar`), and just one database.

  • db.query.text and db.query.parameter.* for such queries are still relevant and make sense on spans (still cardinality is a problem for metrics)
  • Operation and collection names become problematic to record

DB WG is considering multiple options:


Option 1: always capture db.operation.names, db.collection.names, db.collection.namespaces as arrays

Pros: consistent understandable model

Cons:

  • array attribute on metrics
  • simple case (one operation) becomes hard to use
  • hard to query and not quite useful on spans (if we already capture templatized query text)
  • the combined cardinality of operation name, collection names, database names is the same as in the db.query.text

Option 2: capture both db.operation.name and db.operation.names (same for collections and namespaces).

The array attributes are only captured when more than one operation is performed.
In this case we can entertain different options for db.operation.name - it may contain the first operation, operations joined as string, or shouldn't be reported at all.

Pros:

  • simple case is easy
  • complex case is slightly better (e.g. when db.operation.name contains joined list SELECT JOIN)

Cons:

  • array attribute on metrics
  • hard to query and not quite useful on spans (if we already capture templatized query text)
  • the combined cardinality of operation name, collection names, database names is the same as in the db.query.text

Option 3: don't capture multiple operation names, collection names, namespaces

Pros: simple case is easy
Cons: nothing distinguishes different operations in the complex case

There could be other options including opting into collecting templatized query string on metrics, but none of those is perfect.
Still, we'd like to provide a default experience which could be improved with users providing query nick-names (see #521 for the context).


Additional context

@pyohannes
Copy link
Contributor

Option 2: capture both db.operation.name and db.operation.names (same for collections and namespaces).

I wonder if we are over-engineering here. Taking SQL statements as an example, they can be parsed into an AST which has a top-level statement, which tells you whether you SELECT, INSERT, DROP, CREATE ... It's important to have this top-level statement in db.operation.name, as it tells you about the effect the operation has.

I'm not sure about db.collection.name. However, we should avoid a situation where we define a field as an array, which in 95% of all cases only has a single value.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 11, 2024

Another thought I have is that what we actually want is to capture duration per query template. Operation/collection/db names on the measurement is a proxy to identify a subset of queries.

So we want to have a metric for a thing that has high-cardinality in general case (query template) which seems to be impossible.

Ways to limit the cardinality explored above:

  • capture array - limited usability and still has a cardinality issue
  • capture the first: we end up with ambiguity (SELECT * from foo is represented exactly as SELECT * from foo JOIN bar on baz)
  • capture none: resulting metric is not useful

What if we give up on a general case support in default experience?

Option 4:

  • simple case: capture individual operations/collections/namespaces
  • complex case: opt-in to capture templatized db.query.text. We can templatize even further to something like SELECT ? from users where ? or even SELECT ? from users where user_id = ?
    • the cardinality of it would be similar to http.route one
    • we can normalize everything (spaces, case, etc)
    • we can think about other mitigations, like limiting it to X chars

Pros: explicit choice for users who don't expect high cardinality
Cons: perf, still no solution for apps that expect high-cardinality queries

Option 5:

  • simple case: capture individual operations/collections/namespaces
  • complex case: users provide a hint for db.query.name (e.g. SQL commenter, or in some other way) like get user by user id or get user address by user name

Pros:

  • awesome metrics: users can annotate queries they want to have metrics for
  • metrics are future-proof from semconv side - all we need to know about query is expressed in that attribute

Cons:

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 11, 2024

so the proposal (Options 3, 4, and 5 combined):

  • Introduce opt-in db.query.name|template|alias attribute that'd capture low-to-medium-cardinality name when it's available
    • it should be provided by user - instrumentations MAY support a way and MUST document how if they do. We may need to find a generic way for otel, but independently of DB semconv stability
    • it can be parsed from the query text if instrumentation supports it. We can think about this as all array attributes joined together in one string like {operation[i]} ? from {namespace[i]}.{collection[i]} (maybe with where if present). It'd result in the same cardinality as array attributes, but much better experience
  • Capture db.operation.name, db.collection.name, db.collection.namespace ONLY if there is one operation (table or db involved). E.g.:
    • select * from mydb.foo join mydb.bar on baz would result in only the db.collection.namespace=mydb collected

Assuming we could collect operation/collection/namespace by default, we could consider {operation[i]} ? from {namespace[i]}.{collection[i]} joining trick for the default experience as well

@joaopgrassi
Copy link
Member

It seems the dynamodb (which btw don't live under the db. namespace 😓) use an array attribute for the tables aws.dynamodb.table_names

@jack-berg
Copy link
Member

Taking SQL statements as an example, they can be parsed into an AST which has a top-level statement, which tells you whether you SELECT, INSERT, DROP, CREATE ... It's important to have this top-level statement in db.operation.name, as it tells you about the effect the operation has.

I agree. What's the rational is for considering JOIN a operation?

More complicated queries involve multiple operations, tables, or even databases. E.g. in SELECT * from foo JOIN bar ON baz we have two operations (SELECT and JOIN), two tables (foo and bar), and just one database.

I agree with this proposal, but think that we should best effort populate db.operation.name with a single operation even for the complex case. Fundamentally, selects with joins, or selects to insert are all ultimately doing one thing: either selecting data to return, or updating, or inserting, or deleting.

@trask
Copy link
Member

trask commented Apr 25, 2024

@lmolkova is this resolved now that db.collection.name and db.operation.name are specified as the "first found in the query" (when parsed from db.query.text)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

6 participants