Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse_http_server_response_time_seconds_bucket has too high cardinality #11082

Open
daenney opened this issue Oct 14, 2021 · 4 comments
Open
Labels
A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@daenney
Copy link
Contributor

daenney commented Oct 14, 2021

Description

This issue was discovered while looking at other problematic time series due to #11081. There's a similar cardinality issue here, caused by the multiplying factor between the code, servlet and tag labels on the series.

Also, it's problematic to have time series that are part of the same bucket but don't all have the same set of labels. There's only a few series that have the tag label. That should probably be a separate histogram as the label difference will cause aggregation issues.

Steps to reproduce

  • Run lots of Synapse instances
  • Scrape them all with a single Prometheus instance
  • Prometheus gets sad

Version information

  • Homeserver: any
  • Version: any since this metric got introduced
  • Install method: unrelated
  • Platform: unrelated
@DMRobertson DMRobertson added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Oct 14, 2021
@richvdh
Copy link
Member

richvdh commented Oct 18, 2021

I'd be pretty happy to drop this metric altogether. It's not used on our default grafana dashboard.

@daenney
Copy link
Contributor Author

daenney commented Oct 19, 2021

@richvdh This one does appear to be part of the dashboard in synapse's contrib/:

contrib/grafana/synapse.json
123:          "expr": "sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le)",
257:          "expr": "histogram_quantile(0.99, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
264:          "expr": "histogram_quantile(0.9, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
272:          "expr": "histogram_quantile(0.75, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
279:          "expr": "histogram_quantile(0.5, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
286:          "expr": "histogram_quantile(0.25, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
291:          "expr": "histogram_quantile(0.05, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',index=~\"$index\",instance=\"$instance\",code=~\"2..\"}[$bucket_size])) by (le))",
1586:              "expr": "sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\"}[$bucket_size])) by (le)",
2182:              "expr": "histogram_quantile(0.99, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\",job=~\"$job\",index=~\"$index\"}[$bucket_size])) without (method))",
2190:              "expr": "histogram_quantile(0.95, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\",job=~\"$job\",index=~\"$index\"}[$bucket_size])) without (method))",
2198:              "expr": "histogram_quantile(0.90, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\",job=~\"$job\",index=~\"$index\"}[$bucket_size])) without (method))",
2205:              "expr": "histogram_quantile(0.50, sum(rate(synapse_http_server_response_time_seconds_bucket{servlet='RoomSendEventRestServlet',instance=\"$instance\",code=~\"2..\",job=~\"$job\",index=~\"$index\"}[$bucket_size])) without (method))",

@richvdh
Copy link
Member

richvdh commented Oct 19, 2021

oh bothers. Good spot, thank you. It looks like we should create a separate metric for tracking event send time specifically.

@MadLittleMods
Copy link
Contributor

As mentioned at #13478 (comment),

In terms of reducing cardinality, we could remove code. I think for timing, we really just need the method and servlet name. Response code can be useful but maybe we just need to change it to a successful_response boolean (with a cardinality of 2, [true|false]) since we only ever use it as code=~"2..". Or even more useful as error_response: true/false so that success or timeout can still be false while an actual error would be true.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Metrics metrics, measures, stuff we put in Prometheus T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

4 participants