[Merged by Bors] - Improve validator monitor experience for high validator counts #3728

paulhauner · 2022-11-15T07:06:50Z

Issue Addressed

NA

Proposed Changes

Myself and others (#3678) have observed that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation.

To address this, the bn --validator-monitor-individual-tracking-threshold <INTEGER> flag has been added to disable per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is 64, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it not work with 1000s of validators. A default of 64 seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though.

Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there.

I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing.

Additional Info

NA

Breaking Changes Note

A new label has been added to the validator monitor Prometheus metrics: total. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label).

Additionally, a new flag has been added to the Beacon Node: --validator-monitor-individual-tracking-threshold. The default value is 64, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the all_validators metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted).

These changes were introduced in #3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added all_validators metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like --validator-monitor-individual-tracking-threshold 999999.

michaelsproul · 2022-11-15T09:38:54Z

Related: #3678. I haven't looked closely but maybe this could do what you need @agermain?

agermain · 2022-11-20T18:50:34Z

Related: #3678. I haven't looked closely but maybe this could do what you need @agermain?

Yup, this is exactly it. I'm happy to contribute as well!

paulhauner · 2023-01-08T23:11:24Z

I'm yet to test this on a running node. I'll do so today and provide and update here when I've done so.

paulhauner · 2023-01-09T05:32:30Z

there seems to be an issue where gauges set in a loop will overwrite each other

Good call regarding the gauges, there seems to be little/no value in looping through them and re-writing them. I've disabled all gauges unless we're doing individual tracking. We can come back and add averaging in another PR, if we desire. Thanks!

michaelsproul

Awesome! 🎉

One small nit, but happy to merge with or without it

beacon_node/beacon_chain/src/validator_monitor.rs

arnetheduck · 2023-01-09T06:47:29Z

fwiw, we've had this feature in nimbus for a while where we use the metric label total instead of the pubkey when aggregating - potentially, using this label we'd be a bit more compatible.

paulhauner · 2023-01-09T08:13:38Z

bors r+

## Issue Addressed NA ## Proposed Changes Myself and others (#3678) have observed that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation. To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though. Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there. I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing. ## Additional Info NA ## Breaking Changes Note A new label has been added to the validator monitor Prometheus metrics: `all_validators`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label). Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted). These changes were introduced in #3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.

paulhauner · 2023-01-09T08:14:00Z

bors r-

bors · 2023-01-09T08:14:02Z

Canceled.

paulhauner · 2023-01-09T08:18:19Z

fwiw, we've had this feature in nimbus for a while where we use the metric label total instead of the pubkey when aggregating - potentially, using this label we'd be a bit more compatible.

Just in the nick of time! I've swapped to total so we're more like Nimbus ☺️

paulhauner · 2023-01-09T08:18:38Z

bors r+

## Issue Addressed NA ## Proposed Changes Myself and others (#3678) have observed that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation. To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though. Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there. I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing. ## Additional Info NA ## Breaking Changes Note A new label has been added to the validator monitor Prometheus metrics: `total`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label). Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted). These changes were introduced in #3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.

bors · 2023-01-09T10:20:17Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on #3728, #3801~~ - [x] ~~Blocked on #3866~~ - [x] Requires additional testing

…3728) NA Myself and others (sigp#3678) have observed that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation. To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though. Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there. I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing. NA A new label has been added to the validator monitor Prometheus metrics: `total`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label). Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted). These changes were introduced in sigp#3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on sigp#3728, sigp#3801~~ - [x] ~~Blocked on sigp#3866~~ - [x] Requires additional testing

Add "all_validators" metric label

64908fb

paulhauner added the work-in-progress PR is a work-in-progress label Nov 15, 2022

michaelsproul mentioned this pull request Nov 15, 2022

Track block proposals in the validator monitor #3643

Closed

paulhauner added the v3.4.0 Minor release following v3.3.0 label Nov 23, 2022

paulhauner added 4 commits November 29, 2022 17:00

Merge branch 'unstable' into val-mon-cardinality

1700d76

Always track all-validator metrics

7e02a0f

Switch to threshold, thread CLI flag

f05c23b

Merge branch 'unstable' into val-mon-cardinality

e2ddb49

paulhauner added the backwards-incompat Backwards-incompatible API change label Nov 30, 2022

paulhauner added 5 commits November 30, 2022 11:55

Rename tag to label

cf7d2fc

Gate debug logs

2c58591

Aggregate more metrics

7bc46f0

Rename metrics to tracking

34d6878

Tidy

5cd186b

paulhauner changed the title ~~Reduce cardinality for validator monitor metrics~~ Improve validator monitor experience for high validator counts Nov 30, 2022

paulhauner added 6 commits November 30, 2022 15:04

Fix test

393ba3f

Fix compile error in test

be4bf5f

Fix compile error in store test

6663d0d

Fix warning

3053207

Fix compile error

b00ebef

Merge branch 'unstable' into val-mon-cardinality

8f24bc4

paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Dec 5, 2022

paulhauner marked this pull request as ready for review December 5, 2022 06:08

paulhauner added 3 commits January 9, 2023 10:22

Merge branch 'unstable' into val-mon-cardinality

5a57b14

Fix some missed individual metrics

1f9ef28

Don't do indivdual metrics for sync messages

f3a449d

Avoid aggregating gauges

3cd8a2d

paulhauner added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jan 9, 2023

michaelsproul approved these changes Jan 9, 2023

View reviewed changes

beacon_node/beacon_chain/src/validator_monitor.rs Outdated Show resolved Hide resolved

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jan 9, 2023

Simplify if statement

6eb7dc2

Use "total" instead of "all_validators"

1316c0b

paulhauner mentioned this pull request Jan 9, 2023

[Merged by Bors] - Release v3.4.0 #3862

Closed

3 tasks

bors bot changed the title ~~Improve validator monitor experience for high validator counts~~ [Merged by Bors] - Improve validator monitor experience for high validator counts Jan 9, 2023

bors bot closed this Jan 9, 2023

bors bot pushed a commit that referenced this pull request Jan 11, 2023

Release v3.4.0 (#3862)

ad85f56

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on #3728, #3801~~ - [x] ~~Blocked on #3866~~ - [x] Requires additional testing

bors bot pushed a commit that referenced this pull request Jan 11, 2023

Release v3.4.0 (#3862)

ad1aea4

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on #3728, #3801~~ - [x] ~~Blocked on #3866~~ - [x] Requires additional testing

bors bot pushed a commit that referenced this pull request Jan 11, 2023

Release v3.4.0 (#3862)

46a67c3

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on #3728, #3801~~ - [x] ~~Blocked on #3866~~ - [x] Requires additional testing

bors bot pushed a commit that referenced this pull request Jan 11, 2023

Release v3.4.0 (#3862)

38514c0

## Issue Addressed NA ## Proposed Changes Bump versions ## Additional Info - [x] ~~Blocked on #3728, #3801~~ - [x] ~~Blocked on #3866~~ - [x] Requires additional testing

paulhauner mentioned this pull request Jan 16, 2023

Feature request: extend global validator metrics #3678

Closed

This was referenced May 28, 2023

Dashboards break after adding N validators raskitoma/pulse-staking-dashboard#6

Closed

Graph does not show new validators after large number is added Yoldark34/lighthouse-staking-dashboard#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Improve validator monitor experience for high validator counts #3728

[Merged by Bors] - Improve validator monitor experience for high validator counts #3728

paulhauner commented Nov 15, 2022 •

edited

Loading

michaelsproul commented Nov 15, 2022

agermain commented Nov 20, 2022 •

edited

Loading

paulhauner commented Jan 8, 2023

paulhauner commented Jan 9, 2023

michaelsproul left a comment

arnetheduck commented Jan 9, 2023

paulhauner commented Jan 9, 2023

paulhauner commented Jan 9, 2023

bors bot commented Jan 9, 2023

paulhauner commented Jan 9, 2023

paulhauner commented Jan 9, 2023

bors bot commented Jan 9, 2023

[Merged by Bors] - Improve validator monitor experience for high validator counts #3728

[Merged by Bors] - Improve validator monitor experience for high validator counts #3728

Conversation

paulhauner commented Nov 15, 2022 • edited Loading

Issue Addressed

Proposed Changes

Additional Info

Breaking Changes Note

michaelsproul commented Nov 15, 2022

agermain commented Nov 20, 2022 • edited Loading

paulhauner commented Jan 8, 2023

paulhauner commented Jan 9, 2023

michaelsproul left a comment

Choose a reason for hiding this comment

arnetheduck commented Jan 9, 2023

paulhauner commented Jan 9, 2023

paulhauner commented Jan 9, 2023

bors bot commented Jan 9, 2023

paulhauner commented Jan 9, 2023

paulhauner commented Jan 9, 2023

bors bot commented Jan 9, 2023

paulhauner commented Nov 15, 2022 •

edited

Loading

agermain commented Nov 20, 2022 •

edited

Loading