expose some Prometheus metrics #200

marten-seemann · 2021-04-02T03:35:44Z

No description provided.

mxinden

Great to see instrumentation for QUIC to come in the future!

mxinden · 2021-04-02T08:58:36Z

tracer_metrics.go

+	now := time.Now()
+	c.mutex.Lock()
+	for _, conn := range c.conns {
+		if rtt, valid := conn.getSmoothedRTT(); valid {


Do I understand correctly that the RTT is recorded each time Prometheus scrapes the node? Thus the metric data resolution is dependent on the scrape interval of Prometheus?

I would deem this a bit problematic for the following reasons:

Data exposed is dependent on external processes.

The way to achieve high availability is to run 2 or 3 instances of Prometheus each scraping the same set of targets (nodes). This would directly influence data resolution.

In case Prometheus does not scrape the node at all (e.g. due to downtime) no data is recorded in the histogram.

Would it be feasible to record the RTT on each round trip instead?

There are 2 problems with recording the RTT every roundtrip:

If a connection is not sending any data, it will not generate any RTT samples.

If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.

What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.

If a connection is not sending any data, it will not generate any RTT samples.

This applies to both the status quo and my suggestion, correct?

If we have one connection with 50ms RTT and one connection with 500ms RTT (assuming both are sending data and both are using the same congestion window), the shorter-RTT connection will generate 10x as many RTT measurements.

Good point. Thanks for clarifying!

What I'm trying to measure is the RTT distribution of all active connections, so we have to make sure that every connection generates the same number of measurements per scraping interval.

Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.

This applies to both the status quo and my suggestion, correct?

We're saving the most recent RTT sample for every connection, so this problem is solved in the PR. We might use an RTT measurements that a few seconds old, but considering that we're sending a keep-alive PING every 15s, we're generating fresh RTT measurements quite frequently.

Would a collection across connections every X seconds do the trick? Thus data collection would not depend on external factors (number of Prometheus instances and scrape interval) but at the same time uphold the same amount of measurements per connection.

What's the advantage of that? Looping over all connections is moderately expensive, so it would be good to not do it more frequently than necessary.

What's the advantage of that?

Looping over the connections on a fixed interval would make the data returned to Prometheus independent of its scrape interval and the amount of Prometheus instances. E.g. if you have 3 Prometheus instances for high availability, you would be looping over the connections 3 times every scrape interval (e.g. minute). Instead, if you collect on a fixed interval (e.g. 1 minute) you would only loop over all collections once per minute, instead of 3 times per minute.

I think it is fine either way.

mxinden · 2021-04-02T09:10:19Z

tracer_metrics.go

+
+var collector *aggregatingCollector
+
+func init() {


If my Golang skills are not misleading me, all packages importing this package will now expose the metrics below, even if they just make use of some small part of the library, e.g. a helper function. Can the same be achieved with local instead of global variables? E.g. can one pass a metric registry at construction time and keep a reference to the metrics in some local struct?

What's the best practice here? If package wants to expose Prometheus metrics, does the user of that package call a package.InitializeMetrics() function? We could certainly do that, if that's the preferred way.

What's the best practice here?

If I recall correctly the best practice in Golang is to use global metric variables and initialization via init like you did here. I am not in favor of this instrumentation style, due to the various issues that global state brings along. Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport. The returned QuicTransport object would then own the metrics registered with the provided Registry.

I am happy to expand on my architecture, though in the end I think you should go with whatever works best for you.

Instead I would pass a registry as an argument at object creation, e.g. NewQuicTransport.

What would happen if you create multiple transports then? We can only register each metric once.

If the init way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.

What would happen if you create multiple transports then? We can only register each metric once.

What I would do is have a RegisterMetrics function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.

If the init way is actually the common way to do this in Go, let's get this PR merged. We can always improve it later on.

That sounds good to me.

What I would do is have a RegisterMetrics function that takes a registry and returns a structure with all the metrics. I would then pass a clone of the latter to each transport.

That sounds like a much cleaner setup. I'm open to making this change in the future.

tracer_metrics.go

mxinden

Looks good to me from the Prometheus side. Not sure my Golang skills are good enough for a general approval. Your call.

tracer_metrics.go

expose some Prometheus metrics

9ed1409

marten-seemann requested a review from mxinden April 2, 2021 03:35

mxinden reviewed Apr 2, 2021

View reviewed changes

marten-seemann requested a review from mxinden April 2, 2021 12:50

marten-seemann force-pushed the prometheus branch 3 times, most recently from ea3afa1 to fb1b1d8 Compare April 4, 2021 05:54

mxinden approved these changes Apr 5, 2021

View reviewed changes

tracer_metrics.go Outdated Show resolved Hide resolved

rename metrics to adhere to naming conventions

3acd350

marten-seemann force-pushed the prometheus branch from fb1b1d8 to 3acd350 Compare April 5, 2021 11:53

marten-seemann mentioned this pull request Apr 22, 2022

quic: make metrics collection independent of Prometheus' scrape frequency libp2p/go-libp2p#1429

Open

marten-seemann merged commit 1672afe into master Apr 5, 2021

marten-seemann deleted the prometheus branch April 5, 2021 12:32

aschmahmann mentioned this pull request May 14, 2021

Release v0.9.0 ipfs/kubo#8058

Closed

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expose some Prometheus metrics #200

expose some Prometheus metrics #200

marten-seemann commented Apr 2, 2021

mxinden left a comment

mxinden Apr 2, 2021

marten-seemann Apr 2, 2021

mxinden Apr 2, 2021

marten-seemann Apr 3, 2021

mxinden Apr 5, 2021 •

edited

Loading

mxinden Apr 2, 2021

marten-seemann Apr 2, 2021

mxinden Apr 2, 2021

marten-seemann Apr 3, 2021

mxinden Apr 5, 2021

marten-seemann Apr 5, 2021

mxinden left a comment

expose some Prometheus metrics #200

expose some Prometheus metrics #200

Conversation

marten-seemann commented Apr 2, 2021

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden Apr 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden left a comment

Choose a reason for hiding this comment

mxinden Apr 5, 2021 •

edited

Loading