use SlidingTimeWindowArrayReservoir instead of default ExponentiallyDecayingReservoir in dropwizard metrics #11695

gortiz · 2023-09-27T10:27:46Z

As we know, Pinot uses Yammer as default metric exported, which has several issues.

The main issue, explained in #9416, is that Yammer histograms (including Timers) can only use the default exponential decaying reservoir, which returns incorrect results when sample data doesn't have a normal distribution. See the following post to understand the issue:

As explained in the these two blogposts, dropwizard has provides the ability to change the reservoir and there are two well known alternatives:

SlidingTimeWindowArrayReservoir, included in Dropwizard > 3.2.3
HdrHistogramReservoir, a third party reservoir that uses the well known hdr histogram library under the hood.

I have experience using the later and it works very well. But what I've done in this PR is to modify pinot-dropwizard to use SlidingTimeWindowArrayReservoir by default. I've discarded the other two options for the following reasons:

Default exponential decaying reservoir: It is very bad in a lot of cases. See the first mentioned post. This affects us in query latency metrics when tables are not frequently queried.
HdrHistogramReservoir: Although my personal experience with the library is very good, I decided to do not use it in order to do not add just another dependency to Pinot.

…ecayingReservoir in dropwizard metrics

codecov-commenter · 2023-09-27T11:07:39Z

Codecov Report

Merging #11695 (d024450) into master (17ff44e) will decrease coverage by 0.04%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #11695      +/-   ##
============================================
- Coverage     63.08%   63.04%   -0.04%     
+ Complexity     1121     1120       -1     
============================================
  Files          2343     2343              
  Lines        125676   125680       +4     
  Branches      19314    19314              
============================================
- Hits          79285    79239      -46     
- Misses        40739    40787      +48     
- Partials       5652     5654       +2

Flag	Coverage Δ
integration	`<0.01% <ø> (ø)`
integration1	`<0.01% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.01% <0.00%> (-0.02%)`	⬇️
java-17	`62.90% <0.00%> (-0.01%)`	⬇️
java-20	`62.89% <0.00%> (+12.92%)`	⬆️
temurin	`63.04% <0.00%> (-0.04%)`	⬇️
unittests	`63.04% <0.00%> (-0.04%)`	⬇️
unittests1	`67.19% <ø> (-0.04%)`	⬇️
unittests2	`14.45% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
.../metrics/dropwizard/DropwizardMetricsRegistry.java	`23.07% <0.00%> (-4.20%)`	⬇️

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang · 2023-09-27T17:29:23Z

Thanks for raising this and providing the readings! Curious how you decide to configure the sliding window to be 15 minutes? What is the side effect of that?

gortiz · 2023-09-28T07:22:50Z

Thanks for raising this and providing the readings! Curious how you decide to configure the sliding window to be 15 minutes? What is the side effect of that?

I've just decided to use 15 mins because our histograms and timers provide several windowed rate values like oneMinuteRate, fiveMinuteRate and fifteenMinuteRate, being the last the longer one. In order to actually return a correct value for that rate we need to store at least the results during the last 15 mins.

The side effect is the memory the metric requires. SlidingTimeWindowArrayReservoir requires 128 bits per value stored. Therefore each histogram will require:

Size = metric_frequency * time_window_size * 128 / 8 Bytes

Which means that

measures per second	time window (mins)	size (MBs)
10	1	0.0096
10	5	0.048
10	15	0.144
100	1	0.096
100	5	0.48
100	15	1.44
1000	1	0.96
1000	5	4.8
1000	15	14.4
10000	1	9.6
10000	5	48
100000	15	1440

This is one of the problems SlidingTimeWindowArrayReservoir implementation has. The size on heap depends on the measures per second. In case they are not controlled (as it may happen in Pinot), its footprint does not have an upper bound.

HdrHistogram doesn't have this problem. Instead in HdrHistogram you define the min and max expected values (which may be also problematic in our case) and the precision you want to have. From HdrHistogram documentation:

For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000 while maintaining a value precision of 3 significant digits across that range. Value quantization within the range will thus be no larger than 1/1,000th (or 0.1%) of any value. This example Histogram could be used to track and analyze the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).

What do you think? Should we decrease the SlidingTimeWindowArrayReservoir time window to something like 5 mins? Should we add HdrHistogram? In the latter case it would be nice to add new methods to our metric registry to let calling code configure the min and max values.

Jackie-Jiang

Thanks for the detailed explanation! I'm good with this as a starting point since we don't have that many metrics. We may revisit this in the future, or make it configurable.

use SlidingTimeWindowArrayReservoir instead of default ExponentiallyD…

d024450

…ecayingReservoir in dropwizard metrics

Jackie-Jiang added the metrics label Sep 27, 2023

Jackie-Jiang approved these changes Sep 28, 2023

View reviewed changes

Jackie-Jiang merged commit c642db0 into apache:master Sep 28, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use SlidingTimeWindowArrayReservoir instead of default ExponentiallyDecayingReservoir in dropwizard metrics #11695

use SlidingTimeWindowArrayReservoir instead of default ExponentiallyDecayingReservoir in dropwizard metrics #11695

gortiz commented Sep 27, 2023

codecov-commenter commented Sep 27, 2023 •

edited

Loading

Jackie-Jiang commented Sep 27, 2023

gortiz commented Sep 28, 2023

Jackie-Jiang left a comment

use SlidingTimeWindowArrayReservoir instead of default ExponentiallyDecayingReservoir in dropwizard metrics #11695

use SlidingTimeWindowArrayReservoir instead of default ExponentiallyDecayingReservoir in dropwizard metrics #11695

Conversation

gortiz commented Sep 27, 2023

codecov-commenter commented Sep 27, 2023 • edited Loading

Codecov Report

Jackie-Jiang commented Sep 27, 2023

gortiz commented Sep 28, 2023

Jackie-Jiang left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 27, 2023 •

edited

Loading