Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse_storage_transaction_time_bucket prometheus metric has too high cardinality due to desc label #11081

Closed
daenney opened this issue Oct 14, 2021 · 21 comments · Fixed by #13342
Assignees
Labels
A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@daenney
Copy link
Contributor

daenney commented Oct 14, 2021

Description

The desc label on synapse_storage_transaction_time_bucket results in a very high cardinality metric. For a single instance, there's 248 variants of desc, multiplied by 15 buckets. This results in over 3k series for a single host.

Though this might be acceptable if you're only ingesting metrics for a single instance, for Prometheus instances that might be scraping multiple Synapses this quickly becomes a problem.

On one of our internal clusters this is resulting in over a million time series, which is about 5x the amount of time series of the next most problematic timeseries: synapse_http_server_response_time_seconds_bucket. Though this is not causing storage issues, it causes unnecessarily high CPU load and memory usage ballooning on the ingesters (so we have to run with much bigger instances) and querying these series become problematic.

Steps to reproduce

  • Run lots of Synapse instances
  • Scrape them all with a single Prometheus instance
  • Prometheus gets sad

Version information

  • Homeserver: any
  • Version: any since this metric got introduced
  • Install method: unrelated
  • Platform: unrelated
@daenney
Copy link
Contributor Author

daenney commented Oct 14, 2021

For reference, based on promtool tsdb analyze on one of our big clusters:

Highest cardinality metric names:
1054110 synapse_storage_transaction_time_bucket
212280 synapse_http_server_response_time_seconds_bucket
78990 synapse_storage_query_time_bucket
75546 synapse_util_caches_cache_evicted_size

@DMRobertson DMRobertson added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Oct 14, 2021
@richvdh
Copy link
Member

richvdh commented Oct 18, 2021

I'd be pretty happy to drop this metric altogether. It's not used on our default grafana dashboard.

@richvdh richvdh changed the title synapse_storage_transaction_time_bucket has too high cardinality due to desc label synapse_storage_transaction_time_bucket prometheus metric has too high cardinality due to desc label Oct 18, 2021
@richvdh
Copy link
Member

richvdh commented Oct 18, 2021

While we're in the area, it might be worth noting that lots of prometheus metrics are duplicated: #11106.

It's not quite a problem on the same scale as synapse_storage_transaction_time_bucket, but those cache metrics are non-trivial.

@daenney
Copy link
Contributor Author

daenney commented Oct 18, 2021

I'd be pretty happy to drop this metric altogether. It's not used on our default grafana dashboard.

Yup, I noticed that too. Talking to @erikjohnston he was happy for us to stop ingesting this metric altogether, which strongly suggests we can drop it from the code base.

daenney added a commit that referenced this issue Oct 19, 2021
This particular metric has a much too high cardinality due to the fact
that the desc label can have (at present) 248 values. This results in
over 3k series per Synapse. If you have a Prometheus instance that
monitors multiple Synpase instances it results in a huge amount of
additional series to ingest.

The metric in question is also not used in the Synapse dashboard and
the core team has indicated they're happy to drop this metric entirely.

Fixes #11081

Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
@MadLittleMods MadLittleMods added the A-Metrics metrics, measures, stuff we put in Prometheus label Nov 10, 2021
@DMRobertson
Copy link
Contributor

@08d2 noticed this in #12263. Is there anything stopping us from dropping these metrics altogether?

@richvdh
Copy link
Member

richvdh commented Mar 22, 2022

see #11124: turns out that synapse_storage_transaction_time_count and synapse_storage_transaction_time_sum are useful, so we need to be slightly more intelligent than just ripping them out.

@08d2
Copy link

08d2 commented Mar 23, 2022

I agree there's no reason for this to be a histogram and can easily be a summary. Histograms really only make sense when you want to perform statistical aggregations over multiple instances of the same service, but I guess nobody really runs multiple Synapse servers.

But is there anything which actually, definitely, consumes this metric? Are there specific operational use cases which need this level of granularity? If those requirements can't be explicitly enumerated, then IMO the metric can safely be eliminated. "Nice to have" isn't by itself sufficient justification for a telemetry signal :) edit: I understand there's a dashboard which uses this data, but does it really need the level of granularity currently on offer, or can it provide similar utility with a less-precise label set?

@richvdh
Copy link
Member

richvdh commented Mar 24, 2022

we have found these graphs on our grafana dashboard. They are even discussed in https://matrix-org.github.io/synapse/latest/usage/administration/understanding_synapse_through_grafana_graphs.html#transaction-count-and-transaction-duration. I would not be in favour of removing them.

@08d2
Copy link

08d2 commented Mar 25, 2022

Sure, dashboards exist which consume them. But do those dashboards actually provide meaningful operational telemetry? The cardinality of this metric is extraordinarily high. It should deliver extraordinary value to justify that cost. Does it?

@richvdh
Copy link
Member

richvdh commented Mar 28, 2022

yes

@richvdh
Copy link
Member

richvdh commented Mar 28, 2022

sorry, I just realised I missed a word in my previous comment. It should read:

we have found these graphs [very] useful on our grafana dashboard.

@08d2
Copy link

08d2 commented Mar 29, 2022

@richvdh Sweet! What problems have you been able to diagnose and fix? Which servlets have been optimized? Do you have links to issues? If there's data, I'm happy to withdraw my objections.

@DMRobertson
Copy link
Contributor

DMRobertson commented Jul 20, 2022

For my own edification:

For each desc we are currently produce the following timeseries:

So I count 17 time series per named database transaction.

If we just want to track the amount of time spent in a type of transaction (e.g. get_auth_chain_ids) across multiple instances of that transaction without knowing how many, then a single counter would suffice.

If we want to track the mean time a single transaction instance takes, we'd need at least two counters (one for tracking duration, and one for tracking how many txns occur).

Prometheus histograms classify events based on fixed bucket thresholds, whereas Prometheus track the values of a given set of fixed percentiles. (Histogram: "how many requests take < 50ms"? Summary: "How long does it take to serve our quickest 99% of requests?) Or at least, that's how I'm interpreting https://prometheus.io/docs/concepts/metric_types/#histogram.

A histogram that doesn't track any buckets and a summary that doesn't track any quartiles look the same to me.

Additionally, the python client doesn't support tracking quantiles.

@DMRobertson
Copy link
Contributor

Action for this issue, then:

  • use a summary instead of a histogram. This should track 8x (17/2) less data per desc.

If metrics are still eating up space in the future, we could

  • Consider reducing the number of buckets on other histograms, or convert them to summaries
  • Gather together related queries under the same desc.

@DMRobertson DMRobertson self-assigned this Jul 20, 2022
@erikjohnston
Copy link
Member

Summaries have the large downside that you can't aggregate with them, so I'm not generally keen on us using them unless we have a very specific use case in mind.

I think the other two options here are:

  1. Drop the histograms entirely and just use counters (note I don't think the dashboards even expose the histogram metrics)
  2. Add a flag that we can hide high-cardinality metrics behind, so that we can disable them for people running large fleets of hosts. Ideally we'd make this a config option that we could easily enable/disable (in the same way we do for debug vs info logs)

@DMRobertson
Copy link
Contributor

DMRobertson commented Jul 20, 2022

What do you mean by "aggregate" here @erikjohnston?

@erikjohnston
Copy link
Member

c.f. https://prometheus.io/docs/practices/histograms/#quantiles

But basically aiui you can't merge together quantiles from multiple instances, so e.g. you wouldn't be able to see the quantile for a specific desc across all workers, you can only look per worker

@DMRobertson
Copy link
Contributor

DMRobertson commented Jul 20, 2022

Right, but

  • The Python client for prometheus doesn't provide quantiles: we'd only be using it for the _sum and _count counters it provides
  • My understanding: you can ask prometheus to merge together quantiles, it's just that they wont' be statistically meaningful.

DMRobertson pushed a commit that referenced this issue Jul 20, 2022
By my reckoning, this should mean that each txn `desc` produces 2 time
series, down from 17. If we don't like this, we could keep the histogram
but configure it with coarser buckets.

Closes #11081.
@erikjohnston
Copy link
Member

The Python client for prometheus doesn't provide quantiles: we'd literally only be using it for the _sum and _count counters it provides

Ugh, right, I guess that is an easy hack to just keep the useful bits. But it'll suck if they ever decide to actually properly implement summaries.

  • My understanding: you can ask prometheus to merge together quantiles, it's just that they wont' be statistically meaningful.

Well yeah, but there's not much point in doing so if you're going to get dodgy answers out.

@richvdh
Copy link
Member

richvdh commented Jul 20, 2022

The Python client for prometheus doesn't provide quantiles: we'd literally only be using it for the _sum and _count counters it provides

Ugh, right, I guess that is an easy hack to just keep the useful bits. But it'll suck if they ever decide to actually properly implement summaries.

sorry, if our goal is just to get _sum and _count metrics, why don't we just use a pair of Counters, like we do with all other metrics of this kind?

@DMRobertson
Copy link
Contributor

That's fair. I can make it two counters. (I only mention summary based on what E said in #11124 (comment))

DMRobertson pushed a commit that referenced this issue Jul 20, 2022
By my reckoning, this should mean that each txn `desc` produces 2 time
series, down from 17. If we don't like this, we could keep the histogram
but configure it with coarser buckets.

Closes #11081.
DMRobertson pushed a commit that referenced this issue Jul 20, 2022
By my reckoning, this should mean that each txn `desc` produces 2 time
series, down from 17. If we don't like this, we could keep the histogram
but configure it with coarser buckets.

Closes #11081.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
6 participants