Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

metrics: Increase the resolution of histogram metrics #7335

Merged

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jun 5, 2023

These metrics are using the default histogram buckets:

pub const DEFAULT_BUCKETS: &[f64; 11] = &[
    0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];

Which give us a resolution of 5ms, that's good, but there are some subsystems where we process hundreds or even a few thousands of messages per second like approval-voting or approval-distribution, so it makes sense to increse the resoution of the bucket to better understand if the procesisng is in the range of useconds.

The new bucket ranges will be:

[0.0001, 0.0004, 0.0016, 0.0064, 0.0256, 0.1024, 0.4096, 1.6384, 6.5536]

These metrics are using the default histogram buckets:
```
pub const DEFAULT_BUCKETS: &[f64; 11] = &[
    0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];

```

Which give us a resolution of 5ms, that's good, but there are some subsystems
where we process hundreds or even a few thousands of messages per second like
approval-voting or approval-distribution, so it makes sense to increse the
resoution of the bucket to better understand if the procesisng is in the range
of useconds.

The new bucket ranges will be:
```
[0.0001, 0.0004, 0.0016, 0.0064, 0.0256, 0.1024, 0.4096, 1.6384, 6.5536]
```

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh self-assigned this Jun 5, 2023
@alexggh alexggh added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Jun 5, 2023
Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexggh , this is indeed a good idea. Also it might also be worthy, but up to you, to have 2 additional buckets between the last 2: [0.0001, 0.0004, 0.0016, 0.0064, 0.0256, 0.1024, 0.4096, 1.6384, <here>, 6.5536,], just to give more granular info for any very high timings.

@alexggh
Copy link
Contributor Author

alexggh commented Jun 6, 2023

Thanks @alexggh , this is indeed a good idea. Also it might also be worthy, but up to you, to have 2 additional buckets between the last 2: [0.0001, 0.0004, 0.0016, 0.0064, 0.0256, 0.1024, 0.4096, 1.6384, <here>, 6.5536,], just to give more granular info for any very high timings.

yes, makes sense, I won't be able to use the exponential_buckets helper function for that, so I'll define a static vec for that.

Done!

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh force-pushed the feature/increase_resolution branch from 3b754ce to 9efcf3d Compare June 6, 2023 08:46
@alexggh
Copy link
Contributor Author

alexggh commented Jun 6, 2023

Unfortunately, I ran this PR in versi on subset of nodes and it seems that grafana get's confused on situations where some nodes output a range of buckets and the other another different range:

The graphic for a 6h period for versi would look like some data is missing, e.g:
Screenshot 2023-06-06 at 17 23 41

The data is there, and the graphic is shown ok if you select a time period when only just a flavor of the buckets have been used, but it will affect the dashboards for the transition period.

@sandreim any idea if this is something to be concerned with ?

@sandreim
Copy link
Contributor

sandreim commented Jun 6, 2023

This happens when you aggregate metrics from nodes with different bucket configurations. I expect this to be fine. You should try to query the subset of nodes running your changes and select only the time period when only that bucket configuration is present. That should work.

@alexggh
Copy link
Contributor Author

alexggh commented Jun 6, 2023

This happens when you aggregate metrics from nodes with different bucket configurations. I expect this to be fine.

Yes, that's exactly what is happening .

You should try to query the subset of nodes running your changes and select only the time period when only that bucket configuration is present. That should work.

Yes, that works.

@sandreim
Copy link
Contributor

sandreim commented Jun 8, 2023

bot merge

@paritytech-processbot
Copy link

Error: Statuses failed for 9efcf3d

@sandreim
Copy link
Contributor

sandreim commented Jun 8, 2023

bot merge

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants