Skip to content

Commit

Permalink
Improve validator monitor experience for high validator counts (sigp#…
Browse files Browse the repository at this point in the history
…3728)

NA

Myself and others (sigp#3678) have observed  that when running with lots of validators (e.g., 1000s) the cardinality is too much for Prometheus. I've seen Prometheus instances just grind to a halt when we turn the validator monitor on for our testnet validators (we have 10,000s of Goerli validators). Additionally, the debug log volume can get very high with one log per validator, per attestation.

To address this, the `bn --validator-monitor-individual-tracking-threshold <INTEGER>` flag has been added to *disable* per-validator (i.e., non-aggregated) metrics/logging once the validator monitor exceeds the threshold of validators. The default value is `64`, which is a finger-to-the-wind value. I don't actually know the value at which Prometheus starts to become overwhelmed, but I've seen it work with ~64 validators and I've seen it *not* work with 1000s of validators. A default of `64` seems like it will result in a breaking change to users who are running millions of dollars worth of validators whilst resulting in a no-op for low-validator-count users. I'm open to changing this number, though.

Additionally, this PR starts collecting aggregated Prometheus metrics (e.g., total count of head hits across all validators), so that high-validator-count validators still have some interesting metrics. We already had logging for aggregated values, so nothing has been added there.

I've opted to make this a breaking change since it can be rather damaging to your Prometheus instance to accidentally enable the validator monitor with large numbers of validators. I've crashed a Prometheus instance myself and had a report from another user who's done the same thing.

NA

A new label has been added to the validator monitor Prometheus metrics: `total`. This label tracks the aggregated metrics of all validators in the validator monitor (as opposed to each validator being tracking individually using its pubkey as the label).

Additionally, a new flag has been added to the Beacon Node: `--validator-monitor-individual-tracking-threshold`. The default value is `64`, which means that when the validator monitor is tracking more than 64 validators then it will stop tracking per-validator metrics and only track the `all_validators` metric. It will also stop logging per-validator logs and only emit aggregated logs (the exception being that exit and slashing logs are always emitted).

These changes were introduced in sigp#3728 to address issues with untenable Prometheus cardinality and log volume when using the validator monitor with high validator counts (e.g., 1000s of validators). Users with less than 65 validators will see no change in behavior (apart from the added `all_validators` metric). Users with more than 65 validators who wish to maintain the previous behavior can set something like `--validator-monitor-individual-tracking-threshold 999999`.
  • Loading branch information
paulhauner authored and Woodpile37 committed Jan 6, 2024
1 parent af9b5fa commit 5cbbaf8
Show file tree
Hide file tree
Showing 8 changed files with 513 additions and 306 deletions.
10 changes: 9 additions & 1 deletion beacon_node/beacon_chain/src/builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -583,11 +583,13 @@ where
mut self,
auto_register: bool,
validators: Vec<PublicKeyBytes>,
individual_metrics_threshold: usize,
log: Logger,
) -> Self {
self.validator_monitor = Some(ValidatorMonitor::new(
validators,
auto_register,
individual_metrics_threshold,
log.clone(),
));
self
Expand Down Expand Up @@ -1007,6 +1009,7 @@ fn descriptive_db_error(item: &str, error: &StoreError) -> String {
#[cfg(test)]
mod test {
use super::*;
use crate::validator_monitor::DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD;
use eth2_hashing::hash;
use genesis::{
generate_deterministic_keypairs, interop_genesis_state, DEFAULT_ETH1_BLOCK_HASH,
Expand Down Expand Up @@ -1063,7 +1066,12 @@ mod test {
.testing_slot_clock(Duration::from_secs(1))
.expect("should configure testing slot clock")
.shutdown_sender(shutdown_tx)
.monitor_validators(true, vec![], log.clone())
.monitor_validators(
true,
vec![],
DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD,
log.clone(),
)
.build()
.expect("should build");

Expand Down
760 changes: 457 additions & 303 deletions beacon_node/beacon_chain/src/validator_monitor.rs

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion beacon_node/beacon_chain/tests/store_tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ use beacon_chain::builder::BeaconChainBuilder;
use beacon_chain::test_utils::{
test_spec, AttestationStrategy, BeaconChainHarness, BlockStrategy, DiskHarnessType,
};
use beacon_chain::validator_monitor::DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD;
use beacon_chain::{
historical_blocks::HistoricalBlockError, migrate::MigratorConfig, BeaconChain,
BeaconChainError, BeaconChainTypes, BeaconSnapshot, ChainConfig, NotifyExecutionLayer,
Expand Down Expand Up @@ -2121,7 +2122,7 @@ async fn weak_subjectivity_sync() {
log.clone(),
1,
)))
.monitor_validators(true, vec![], log)
.monitor_validators(true, vec![], DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD, log)
.build()
.expect("should build"),
);
Expand Down
1 change: 1 addition & 0 deletions beacon_node/client/src/builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ where
.monitor_validators(
config.validator_monitor_auto,
config.validator_monitor_pubkeys.clone(),
config.validator_monitor_individual_tracking_threshold,
runtime_context
.service_context("val_mon".to_string())
.log()
Expand Down
3 changes: 2 additions & 1 deletion beacon_node/network/src/subnet_service/tests/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ use super::*;
use beacon_chain::{
builder::{BeaconChainBuilder, Witness},
eth1_chain::CachingEth1Backend,
validator_monitor::DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD,
BeaconChain,
};
use futures::prelude::*;
Expand Down Expand Up @@ -75,7 +76,7 @@ impl TestBeaconChain {
Duration::from_millis(SLOT_DURATION_MILLIS),
))
.shutdown_sender(shutdown_tx)
.monitor_validators(true, vec![], log)
.monitor_validators(true, vec![], DEFAULT_INDIVIDUAL_TRACKING_THRESHOLD, log)
.build()
.expect("should build"),
);
Expand Down
11 changes: 11 additions & 0 deletions beacon_node/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -763,6 +763,17 @@ pub fn cli_app<'a, 'b>() -> App<'a, 'b> {
.value_name("PATH")
.takes_value(true)
)
.arg(
Arg::with_name("validator-monitor-individual-tracking-threshold")
.long("validator-monitor-individual-tracking-threshold")
.help("Once the validator monitor reaches this number of local validators \
it will stop collecting per-validator Prometheus metrics and issuing \
per-validator logs. Instead, it will provide aggregate metrics and logs. \
This avoids infeasibly high cardinality in the Prometheus database and \
high log volume when using many validators. Defaults to 64.")
.value_name("INTEGER")
.takes_value(true)
)
.arg(
Arg::with_name("disable-lock-timeouts")
.long("disable-lock-timeouts")
Expand Down
6 changes: 6 additions & 0 deletions beacon_node/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -693,6 +693,12 @@ pub fn get_config<E: EthSpec>(
.extend_from_slice(&pubkeys);
}

if let Some(count) =
clap_utils::parse_optional(cli_args, "validator-monitor-individual-tracking-threshold")?
{
client_config.validator_monitor_individual_tracking_threshold = count;
}

if cli_args.is_present("disable-lock-timeouts") {
client_config.chain.enable_lock_timeouts = false;
}
Expand Down
25 changes: 25 additions & 0 deletions lighthouse/tests/beacon_node.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1237,6 +1237,31 @@ fn validator_monitor_file_flag() {
assert_eq!(config.validator_monitor_pubkeys[1].to_string(), "0xbeefdeadbeefdeaddeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef");
});
}
#[test]
fn validator_monitor_metrics_threshold_default() {
CommandLineTest::new()
.run_with_zero_port()
.with_config(|config| {
assert_eq!(
config.validator_monitor_individual_tracking_threshold,
// If this value changes make sure to update the help text for
// the CLI command.
64
)
});
}
#[test]
fn validator_monitor_metrics_threshold_custom() {
CommandLineTest::new()
.flag(
"validator-monitor-individual-tracking-threshold",
Some("42"),
)
.run_with_zero_port()
.with_config(|config| {
assert_eq!(config.validator_monitor_individual_tracking_threshold, 42)
});
}

// Tests for Store flags.
#[test]
Expand Down

0 comments on commit 5cbbaf8

Please sign in to comment.