Problems with Unstable Benchmarks #485

Luapulu · 2021-07-01T09:57:05Z

I've been playing with the raindrops exercise from Exercism in both rust and Julia. In Julia the minimum time seems to be quite stable, by which I mean that rerunning the benchmark doesn't change the minimum much at all. However in rust with criterion my benchmarks vary quite significantly from run to run. I've already increased the confidence level to 0.99 and decreased the signifcance level to 0.01 but I still regularly get criterion telling me that I have regressions or improvements even though nothing has changed.

May be I've screwed up something in the implementation or this is some kind of fundamental difference between the languages Julia and rust, but I suspect it has to do with the way criterion estimates central tendency. I have three ideas for what might help.

Use the minimum as an estimator

I haven't figured out yet how to get criterion to show me and use the minimum. May be this is already implemented and I'm just screwing up.

We have a distribution that has a lower bound and seems to have a fairly fat tail. This combination makes the sample mean converge very slowly to the true mean. By comparison, because there isn't really any effect that would be able to artificially decrease the benchmark time, the minimum should be the most resistant to noise of the possible estimators. Of course, it doesn't tell you what the expected time to complete some workload would be, but that's not the point of these kind of microbenchmarks anyway. Usually, you're interested in whether some change you made has improved or worsened performance.

Don't remove low outliers

With the exception of timing errors, which are tiny compared to other sources of noise, no source of noise could lower times. So, I'm not sure such a thing as a low outlier even exists in this case.

Take more measurements with fewer samples per measurement

The more iterations you take together the higher the probability that noise will push up the benchmark time. Of course, there's a tradeoff here because the measurement needs to take enough time to where you can get an accurate measurement. But I think you could push up the number of measurements quite a bit before that becomes an issue. This way, fewer measurements would have their time pulled up because a few samples were very slow and pulled up the whole measurement mean. Then, minimum and median would be better estimators, too.

Again, I think this point basically boils down to sample means being bad estimators of the true mean for fat-tailed distributions. The more samples you take per measurement, the more you rely on the mean as a good estimate of the rental tendency of all the samples in that measurement.

Hopefully this is useful. I'm new to this package and rust, so there's a good chance I missed something. Let me know what you think :)

Luapulu · 2021-07-01T10:09:53Z

Ok, I just increased the number of samples to 10 000 and I feel more confident saying the distribution is fat tailed and that at the very least taking more samples per measurements significantly improves the stability of the benchmark results. The analysis becomes very slow though.

Taking lots of samples per measurement really hides the fat-tailedness. It's pernicious.

bheisler · 2021-07-07T22:11:49Z

Hello!

Using the minimum as the estimator is not supported; the reason why is explained in more detail in the user guide but the short version is that sometimes (realistically, always) the actual performance of the function is variable - for example, Rust's HashMap type is keyed differently from run to run which can produce a greater or lesser number of hash collisions from one run to the next - which means that reporting just the minimum runtime would be misleading. Additionally, because Criterion-rs doesn't measure each individual execution but instead measures a batch of executions, the closest you could possibly get is the minimum of the means of the batches.

Second, Criterion-rs does not ignore the outliers, they're merely printed for the user to see an indication of how reliable the results are. They're honestly not a very good indicator and I may remove them or change how they're calculated in a future version.

For your third point - yes? I guess the way I see it is that the 'noise' is actually signal. Unless you can guarantee that your production environment will be less noisy than your testing environment (and if you can, then just do the same magic trick for your testing environment) then including the 'noise' in the measurement gives you a more-accurate picture of how your code will perform in production. This is also why Criterion-rs displays the full estimated PDF of the performance distribution and another reason for not just using the minimum - your code won't execute in the minimum possible time on every execution.

The confidence and significance levels are merely thresholds used for reporting, they don't affect the measurement process. As you've noticed, you'll need to increase the measurement time or sample count to get a more reliable measurement.

Luapulu · 2021-07-09T11:32:04Z

Hey, thanks for the response :)

Using the minimum as the estimator is not supported; the reason why is explained in more detail in the user guide but the short version is that sometimes (realistically, always) the actual performance of the function is variable - for example, Rust's HashMap type is keyed differently from run to run which can produce a greater or lesser number of hash collisions from one run to the next - which means that reporting just the minimum runtime would be misleading. Additionally, because Criterion-rs doesn't measure each individual execution but instead measures a batch of executions, the closest you could possibly get is the minimum of the means of the batches.

I'm fine with keeping the other measures around. All I'd like to have is to have the option of seeing the minimum. I understand it will be the minimum of a mean, because you need to average to decrease the impact of timing errors.

For your third point - yes? I guess the way I see it is that the 'noise' is actually signal. Unless you can guarantee that your production environment will be less noisy than your testing environment (and if you can, then just do the same magic trick for your testing environment) then including the 'noise' in the measurement gives you a more-accurate picture of how your code will perform in production. This is also why Criterion-rs displays the full estimated PDF of the performance distribution and another reason for not just using the minimum - your code won't execute in the minimum possible time on every execution.

The thing is, I don't use these kind of benchmarks to try to get an idea of what real world execution times will be like. I use them mostly to see if there's been a regression or improvement after I changed something. And to do that the minimum is exactly what you want. :)

bertiqwerty · 2021-07-26T17:42:21Z

Hi there, I have just recently started using Criterion. Thanks for the nice library.

About the minimum, I agree with @Luapulu and also would really appreciate this as an additional option. For deterministic algorithms, taking the mimium makes sense to me, since the noise only adds more time. See also this explanation by Andrei Alexandrescu in favor of the minimum https://youtu.be/vrfYLlR8X8k?t=1024.

bheisler · 2021-07-26T22:50:53Z

I guess my feeling is that a lot of people incorrectly believe they have deterministic algorithms, and would be prone to enabling such an option when they shouldn't. Anything that touches hashmaps or multi-threading isn't deterministic, nor is anything that makes system calls to the operating system (or at least, performance-determinism on the part of the OS can't be assumed). I think many people expect hashmaps at least to be deterministic and wouldn't notice if it isn't, and I'm concerned about that being a potential trap for the unwary.

Adding options always increases the burden on the user to understand what the options are and what they mean and to choose the right one. I tend to lean towards having Criterion-rs be opinionated rather than configurable when I can, and at least provide sensible defaults when I can't. On that basis I'm kind of wavering back and forth between sticking with the mean (potentially displaying the min in the verbose output along with the other statistics) and adding an option to change the estimator (such an option should default to using the mean and be documented with a discussion of the tradeoffs). The Andrei Alexandrescu video also suggests a third option, which is to take the peak of the estimated PDF instead of the mean, as a way of estimating the mode. I have no idea what kind of statistical properties that would have, but intuitively it seems it probably would be more robust against noise than the mean and less prone to latching on to the one perfect iteration than the minimum.

What do you think, @lemmih?

jaskij · 2021-10-15T10:09:15Z

Re: determinism on the part of OS, I'm doing some microbenchmarks (code that compiles to a single AVX instruction), and seeing around ~3-5% of run-to-run variation.

I'm working under Linux on Ryzen 5900X. My best guess to the reason is lack of support for CPPC and the scheduler throwing the bench on different core each time, with different boost clocks. Not to mention that boosting behavior by itself is not fully deterministic, as it depends on environmental factors.

Here's the code: https://gitlab.com/jaskij/math-vec-static/-/tree/ff7599236ff9aee4717bb9b075d9dba55a2dc443
Sorry for not exactly being minimal, but I believe it's small enough.

Edit: I almost forgot my actual question:
To that end, would it make sense to ask for core pinning to be implemented in Criterion, or just switch to Iai or criterion-linux-perf?

asomers · 2022-04-24T14:57:03Z

I"m using a Ryzen 5950X, and I also see lots of variability. But pinning the process to a single core using the cpuset command does eliminate most of the variability. I also used powerd to throttle the CPU down to minimum speed, hoping that that would disable boost. But I'm not sure it does.

bazhenov · 2023-05-09T07:31:42Z

Just saw this thread and I feel like I should step in for the @bheisler point. There is no such thing as deterministic performance. Ok, it maybe the case in microcontrollers and embedded software where performance is still proportional to the number of instructions in the firmware. But it's not the case for modern CPUs.

With the exception of timing errors, which are tiny compared to other sources of noise, no source of noise could lower times.

Actually, it can. Consider binary search. The performance of the algorithm is bounded by two things (1) memory access and (2) branch predictor. Set aside memory access, if you test binary search over small enough dataset you will end up with small but predictable number of runs where searched value favor current state of branch predictor. Like for example if you search for minimum value in the array, all turns when comparing needle to the midpoint will be left turns which favor branch predictor (see the most upvoted question on the StackOverflow).

You should never use extreme observations (like min or max) as reliable estimates of the algorithm performance. But you don't want to ignore outliers also. You want to use outlier detector as an indicator that benchmark is unstable and you need to do something about it, but it's your responsibility to figure it out what exactly.

There are a lot of performance factors in CPUs you have no control over. It might be that some another process in the OS create cache contention, it might be that the OS or CPU itself decide to throttle itself down by thermal or power considerations. In general if benchmark producing unstable results it means that benchmark has too much degrees of freedom. The performance of the code depend not only the code itself, but also environment. So you need to control for environmental factors: the size of payload, data-dependent code flow should not be easily predicted, benchmark iterations should be stationary (hence warmup and linear sampling).

So we always should think about performance statistically when comparing two versions of an algorithm. This means comparing means (pun not intended) and not min and max.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Unstable Benchmarks #485

Problems with Unstable Benchmarks #485

Luapulu commented Jul 1, 2021

Luapulu commented Jul 1, 2021

bheisler commented Jul 7, 2021

Luapulu commented Jul 9, 2021

bertiqwerty commented Jul 26, 2021 •

edited

Loading

bheisler commented Jul 26, 2021

jaskij commented Oct 15, 2021 •

edited

Loading

asomers commented Apr 24, 2022

bazhenov commented May 9, 2023

Problems with Unstable Benchmarks #485

Problems with Unstable Benchmarks #485

Comments

Luapulu commented Jul 1, 2021

Luapulu commented Jul 1, 2021

bheisler commented Jul 7, 2021

Luapulu commented Jul 9, 2021

bertiqwerty commented Jul 26, 2021 • edited Loading

bheisler commented Jul 26, 2021

jaskij commented Oct 15, 2021 • edited Loading

asomers commented Apr 24, 2022

bazhenov commented May 9, 2023

bertiqwerty commented Jul 26, 2021 •

edited

Loading

jaskij commented Oct 15, 2021 •

edited

Loading