Create and maintain a benchmarking suite for Wasmtime #4

fitzgen · 2020-11-18T00:21:47Z

Rendered RFC

Note: this is complimentary to RFC #3. Where that RFC is seeking consensus on whether we should establish benchmarking infrastructure at all, this RFC focuses on defining where we want our benchmarking story to end up at the end of the day: how we select benchmark programs, how we avoid measurement bias, how we perform sound and rigorous statistical analyses of the results, etc.

bjorn3 · 2020-11-18T08:33:40Z

0000-benchmark-suite.md

+
+> Note that this approach is more coarsely grained than what Stabilizer
+> achieves. Stabilizer randomized code layout at the function granularity, while
+> this approach works at the object file granularity.


I believe lld has an dedicated option to specify the section order. In any case a linker script should work. Every function is put in a different section by rustc, so randomizing section order should be enough.

bjorn3 · 2020-11-18T08:36:09Z

0000-benchmark-suite.md

+statements like "we are 95% confident that change X produces a 5.5% (+/- 0.8%)
+speed up compared to the original system". See the "Challenge of Summarizing
+Results" and "Measuring Speedup" sections of *Rigorous Benchmarking in
+Reasonable Time*<sup>[4](#ref-4)</sup> for further details.


Should there be more bench runs when the difference on a benchmark is less statistically significant?

The referenced paper grapples with this question, and ultimately gives you a formula where you can plug in the levels of variation (taken from some initial experiments) at each layer and a desired confidence interval and then it tells you how many iterations at each level you need to perform in your experiment before you actually conduct it. It's a good paper, I recommend reading it!

It is not recommended to at runtime keep running the benchmark for X minutes or until a given confidence level is reached because this makes results between different runs (e.g. with and without your changes) subtly incomparable.

bjorn3 · 2020-11-18T08:36:59Z

0000-benchmark-suite.md

+For optimal developer ergonomics, it should be possible to run the benchmark
+suite remotely by leaving a special comment on a github pull request:
+
+> @bytecodealliance-bot bench


Who should be allowed to run it? Will there be a rate-limit per developer?

This is discussed a bit in #3 as well. I expect that we will use an explicit allow list. I don't think we need to worry about rate limiting right off the bat, since the allow list will be people we all trust and can just ping if there are any issues.

0000-benchmark-suite.md

alexcrichton · 2020-11-18T15:17:11Z

This looks fantastic to me, and thanks for taking the time to lay out milestones at the end, it was quite helpful to see what sort of work is necessary for each particular outcome! I'm pretty interested in how the nitty gritty of a lot of this will work, but that's out-of-scope for this RFC I think in that it's probably best to handle in code review and such! (I also left a comment on #3 about stability which may apply more here, but that's probably all out of scope of these two RFCs)

jlb6740 · 2020-11-19T01:16:13Z

@fitzgen This is fantastic! It seems to not merely describe criteria for adding a benchmark to a future benchmark suite but really focuses on details of a strategy for an approach to testing those workloads. This involves the characteristics of the benchmarks, specific testing features for wasmtime/cranelift, and characteristics of the runner. So there is a lot here, and a lot to potentially comment on but this clearly a ton of good practices with references that we should probably adopt. Agree with @alexcrichton that milestone description and graph is cool ... very nice to visualize the dependencies there. So all of this is well done, below I'll just focus some comments on things I would think about:

From Motivation:

Evaluating the speedup of a proposed performance optimization highlights the need for developers to be able to quickly run and maybe selectively run a benchmark that targets aspects of the compiler/runtime that they are targeting. This means being able to run offline before pushing a patch for review. This points to a need for a tight coupling between the distribution of wasmtime/cranelift and the runner in addition to maybe a separate offline/remote/nightly runner. Testing performance regressions can be done in tiers, a much faster run performs a smoke test that's part of patch merger and a separately the more comprehensive daily test as mentioned.

At first I questioned the non-goal because while I see how that impacts decisions made for the runner and viewer, I wasn't quite sure about the term "general-purpose". I figure the benchmark suite would target the same aspects of wasmtime/cranelift that they would any VM and runtime right? Maybe the point here is that it is good to target the VM/runtime and not the underlying architecture (no inline assembly for example?)? Or good if we make any modifications to the runner or workload, to only be concerned about it working with wasmtime? One thing I would add is it is important I believe to have multiple perspectives when judging how well the execution performed (meaning multiple baselines). Certainly the previous run and previous several runs is a great baseline, but also testing against native will be important. A native baseline supports a third motivating factor for this infrastructure which is to find gaps in performance that need optimizations and improvement.

From Proposals:

Maybe this is second tier reporting and not default, but I would add a breakdown of wall time into user time and system time. Depending on the benchmark, kernel time can be dominant and so is good to factor in sooner rather than later during offline analysis. Also for candidate benchmarks, some of the requirements may force us from just taking a code as is without modification (adding means for more than one associated input for example) that may not be so critical.

Real widely used programs is certainly desirable, but I personally don't have a problem with micro benchmarks and synthetic benchmarks if they serve their purpose. Microbenchmarks can be great at stress testing to the limits certain aspects of VM that you simply aren't going to get from a real world program. Furthermore, with real world programs you aren't always enlightened into what makes them tick, what aspects are you really timing and testing. With micro benchmarks it is no secret what you are testing and you can know when you should anticipate change in performance. That said, I must say that initial candidate list is very impressive and certainly this real world programs are a must priority.

From Rationale and Alternatives:

I agree with the drawback of not being able to add certain benchmarks because they assume javascript, but I will say there may be somethings we can harvest from those benchmarks with a little work or no work at all. JetStream2 for example has some benchmarks such as gcc_loops that can be compiled directly to wasm without modification and in fact is capable of being auto-vectorization. I've used it myself when looking at SIMD performance of Wasmtime but there maybe others there. Also, I wouldn't mind including shootout as or some subset of shootout for reasons I described, plus I think it help to inform some aspect of lucet development and as the two are merge it may be a good reference.

From Open Questions:

The main thing I'd touch on here is the runner. Can and should we evolve Sightglass's benchmark runner. My initial thoughts here are that I think ultimately it may be best to have a Sightglass 2.0. The way the driver is setup now it is really not intended to be tightly coupled with any one VM in order to allow adding any workload without criteria and any runtime. The original code is written in rust and perhaps much of this can be reused, though then again some of the ideas that @abrown has mentioned may suggest gutting this or may suggest a completely fresh start with only Wasmtime and the requirements of this RFC in mind.

Also, not all workloads simply measure performance by time spent. Workloads can of course run for a fixed amount of time and then report the amount work done in terms of some metric or score. Other workloads have already put in place timers and print out the time of the important code for you. Do we want our infrastructure to allow workloads/benchmarks to report their own time where we can (through script) parse that data and still report it consistent with all the other benchmark report out?

fitzgen · 2020-11-30T18:22:05Z

@lars-t-hansen says (quoted with permission):

I support this. I especially support the Motivation and avoiding a benchmarking race-to-the-bottom... Thanks for doing it!

Everything looks really solid and I don't think I have anything to contribute at this point, except I think that if we could find some way for this to work with the JS shells its value would increase significantly. A discussion for another day or a derivative effort.

Hyperthreading and NUMA are concerns that need to be addressed when running. (Whenever I run compilation time tests on my dev system I need to be sure to disable NUMA to get repeatable and meaningful results. Possibly systems in data centers that will run regression tests are more likely NUMA than developer desktop systems?)

One gut feeling I have is that to regression test some types of changes you really want microbenchmarks, not applications, because you don't know that your applications focus enough on the feature you're trying to safeguard in the context where it is valuable. (For the same reason I have started adding test cases that check that a specific instruction is generated by the compiler at least in a specific controlled context, so as to check that an optimization works in at least that context.) But that's really additive to what you're proposing, not conflicting with it.

fitzgen · 2020-12-10T23:38:22Z

Thanks for the comments @jlb6740! Replies inline below.

Evaluating the speedup of a proposed performance optimization highlights the need for developers to be able to quickly run and maybe selectively run a benchmark that targets aspects of the compiler/runtime that they are targeting. This means being able to run offline before pushing a patch for review. This points to a need for a tight coupling between the distribution of wasmtime/cranelift and the runner in addition to maybe a separate offline/remote/nightly runner.

Yep, and I would expect that they would share 99% of their code. The main difference would be that a developer running locally would supply two commit hashes or branches to compare (e.g. main and my-optimization-branch) and the runner would build and run benchmarks for both, where as in the offline/remote/nightly automation the runner would only build and run the latest main version of wasmtime.

Testing performance regressions can be done in tiers, a much faster run performs a smoke test that's part of patch merger and a separately the more comprehensive daily test as mentioned.

Yes, exactly. This is the point of having multiple input workloads for our benchmark candidates: one small workload to get a quick smoke test, and one larger workload for in-depth data.

Maybe this is second tier reporting and not default, but I would add a breakdown of wall time into user time and system time. Depending on the benchmark, kernel time can be dominant and so is good to factor in sooner rather than later during offline analysis.

Good idea, this is not only something we can record in the runner, but also something we can classify candidate benchmark programs by in our PCA.

Also for candidate benchmarks, some of the requirements may force us from just taking a code as is without modification (adding means for more than one associated input for example) that may not be so critical.

Yeah, ideally we would only be need to add bench.start and bench.end calls. As a practicality, it may make more sense to err on the side of avoiding benchmark programs that aren't well-factored from their workloads, because otherwise updating such programs in the future may become a real chore.

Real widely used programs is certainly desirable, but I personally don't have a problem with micro benchmarks and synthetic benchmarks if they serve their purpose. Microbenchmarks can be great at stress testing to the limits certain aspects of VM that you simply aren't going to get from a real world program. Furthermore, with real world programs you aren't always enlightened into what makes them tick, what aspects are you really timing and testing. With micro benchmarks it is no secret what you are testing and you can know when you should anticipate change in performance.

Regarding not knowing what makes the program tick: by requiring source code for candidate programs, we can hopefully avoid this to some degree. Agreed that micro-benchmarks are easier to quickly grok, though.

Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.

I agree with the drawback of not being able to add certain benchmarks because they assume javascript, but I will say there may be somethings we can harvest from those benchmarks with a little work or no work at all. JetStream2 for example has some benchmarks such as gcc_loops that can be compiled directly to wasm without modification and in fact is capable of being auto-vectorization. I've used it myself when looking at SIMD performance of Wasmtime but there maybe others there.

Ah, great! Yeah, we can definitely include that as a candidate then.

The main thing I'd touch on here is the runner. Can and should we evolve Sightglass's benchmark runner. My initial thoughts here are that I think ultimately it may be best to have a Sightglass 2.0. The way the driver is setup now it is really not intended to be tightly coupled with any one VM in order to allow adding any workload without criteria and any runtime. The original code is written in rust and perhaps much of this can be reused, though then again some of the ideas that @abrown has mentioned may suggest gutting this or may suggest a completely fresh start with only Wasmtime and the requirements of this RFC in mind.

Yeah, it has become clear to me that we can replace just the runner with a 2.0 version, and then make it spit out data in a compatible format that the rest of sightglass expects. This way we still keep the server, UI, and graphs and all that working. I'll start prototyping this on top of @abrown's prototype of the new runner.

Also, not all workloads simply measure performance by time spent. Workloads can of course run for a fixed amount of time and then report the amount work done in terms of some metric or score. Other workloads have already put in place timers and print out the time of the important code for you. Do we want our infrastructure to allow workloads/benchmarks to report their own time where we can (through script) parse that data and still report it consistent with all the other benchmark report out?

The downside with workloads like "run X for Y amount of time" is that it is both harder to verify that the program produced the correct result (and we don't want to measure a buggy/broken program!) and counters like instructions retired are not semi-deterministic anymore, which makes comparing executions more difficult. Additionally, it is a different kind of program that the runner will have to interface with and understand how to run and report measurements for, which is additional engineering work for us. Ultimately, I don't think we want to add these kinds of benchmark programs to the suite.

abrown · 2021-01-06T21:11:38Z

@penzin pointed me to https://github.com/binji/raw-wasm; since you suggest here that we add benchmarks that use some Wasm-related tools. https://github.com/binji/raw-wasm could provide some inputs to those tools.

Co-authored-by: bjorn3 <bjorn3@users.noreply.github.com>

This is apparently where we've decided all RFCs (including unmerged RFCs with open PRs) go.

fitzgen · 2021-01-11T20:00:03Z

sunfishcode · 2021-01-11T20:21:25Z

accepted/benchmark-suite.md

+on the Web. Doing that well requires including interactions with the rest of the
+Web browser: JavaScript, rendering, and the DOM. Building and integrating a full
+Web browser is overkill for our purposes, and represents significant additional
+complexity that we would prefer to avoid.


Reading through this RFC again, one thing that occurs to me is that it doesn't clarify whether "microbenchmarks" are in scope or not. On one hand, we might say that micro-benchmarks are in indeed in scope, specifically because we're not trying to build a General-Purpose benchmark suite, but instead just something to collect data to help identify performance changes over time. On the other hand, some parts of this proposal talk about a desire for a representative corpus. Could you clarify the intended stance on microbenchmarks?

I think my position is summarized in my response to Johnnie up thread:

Micro-benchmarks have their place, and we can include some as we find blind spots and want to target specific things (like Lars also mentioned) but I think it makes sense, especially as we create the initial set of candidates, to focus on real programs.

Multiple people have mentioned the desire to run microbenchmarks, and if we get value out of having them, I don't want to stand in the way for ideological reasons. As long as the microbenchmarks use the same benchmarking API/protocol that the rest of the corpus uses, supporting them should be straightforward and shouldn't require any additional engineering effort.

accepted/benchmark-suite.md

fitzgen · 2021-01-11T21:34:48Z

Entering Final Call Period

https://github.com/bytecodealliance/rfcs/blob/main/accepted/rfc-process.md#making-a-decision-merge-or-close

Once any stakeholder from a different group has signed off, the RFC will move into a 10 calendar day final comment period (FCP), long enough to ensure that other stakeholders have at least a full business week to respond.

The FCP will end on January 22nd.

lars-t-hansen · 2021-01-12T07:39:35Z

I appear unable to approve this, but FWIW I have no objections.

tschneidereit · 2021-01-12T15:20:39Z

I appear unable to approve this, but FWIW I have no objections.

Yeah, this is unfortunately tricky to fix without an rfc-bot doing it for us. For now, using gh reviews seems like the best path forward, with the champion setting the check marks. Apologies for the rough edges here!

fitzgen · 2021-01-22T23:48:35Z

The FCP has elapsed without any objections being raised. I'm going to merge this. Thanks everybody!

bjorn3 reviewed Nov 18, 2020

View reviewed changes

0000-benchmark-suite.md Outdated Show resolved Hide resolved

bnjbvr reviewed Nov 18, 2020

View reviewed changes

0000-benchmark-suite.md Outdated Show resolved Hide resolved

fitzgen mentioned this pull request Nov 18, 2020

RFC: benchmarking infrastructure. #3

Merged

cfallin mentioned this pull request Nov 23, 2020

[proposal] fast path for general integer division for x64 bytecodealliance/wasmtime#2439

Open

fitzgen and others added 8 commits January 11, 2021 11:40

Create and maintain a benchmarking suite for Wasmtime

116862c

Fix typo in title

6642755

Co-authored-by: bjorn3 <bjorn3@users.noreply.github.com>

Not just open source, but freely distributable and modifiable

437e5db

Add kernel vs user space metric

4251811

Add gcc-loops-wasm potential candidate program

bdd055b

Add a note about LD_BIND_NOW, hyperthreading, and NUMA

5e17b4c

Fix link to comment about LD_BIND_NOW

ca38fac

Move RFC into accepted directory

88ba65c

This is apparently where we've decided all RFCs (including unmerged RFCs with open PRs) go.

fitzgen force-pushed the benchmark-suite branch from d577175 to 88ba65c Compare January 11, 2021 19:42

cfallin approved these changes Jan 11, 2021

View reviewed changes

alexcrichton approved these changes Jan 11, 2021

View reviewed changes

tschneidereit added disposition-merge motion-to-finalize project-cranelift project-wasmtime labels Jan 11, 2021

abrown approved these changes Jan 11, 2021

View reviewed changes

sunfishcode approved these changes Jan 11, 2021

View reviewed changes

Make all references to webassembly/benchmarks hyperlinks

c10edfb

tschneidereit added final-comment-period and removed motion-to-finalize labels Jan 12, 2021

fitzgen merged commit dc2d44c into bytecodealliance:main Jan 22, 2021

fitzgen deleted the benchmark-suite branch January 22, 2021 23:48

This was referenced Feb 3, 2021

Performance bytecodealliance/wasmtime#85

Closed

Naive fibonacci benchmark over 2x slower than native bytecodealliance/wasmtime#972

Closed

pchickey mentioned this pull request Mar 18, 2021

CoreMark 1.0 WASI build freezes bytecodealliance/wasmtime#2737

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create and maintain a benchmarking suite for Wasmtime #4

Create and maintain a benchmarking suite for Wasmtime #4

fitzgen commented Nov 18, 2020 •

edited

Loading

bjorn3 Nov 18, 2020

bjorn3 Nov 18, 2020

fitzgen Nov 18, 2020 •

edited

Loading

bjorn3 Nov 18, 2020

fitzgen Nov 18, 2020

alexcrichton commented Nov 18, 2020

jlb6740 commented Nov 19, 2020 •

edited

Loading

fitzgen commented Nov 30, 2020

fitzgen commented Dec 10, 2020

abrown commented Jan 6, 2021 •

edited

Loading

fitzgen commented Jan 11, 2021 •

edited by yurydelendik

Loading

sunfishcode Jan 11, 2021

fitzgen Jan 11, 2021

fitzgen commented Jan 11, 2021

lars-t-hansen commented Jan 12, 2021

tschneidereit commented Jan 12, 2021

fitzgen commented Jan 22, 2021

Create and maintain a benchmarking suite for Wasmtime #4

Create and maintain a benchmarking suite for Wasmtime #4

Conversation

fitzgen commented Nov 18, 2020 • edited Loading

Rendered RFC

bjorn3 Nov 18, 2020

Choose a reason for hiding this comment

bjorn3 Nov 18, 2020

Choose a reason for hiding this comment

fitzgen Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

bjorn3 Nov 18, 2020

Choose a reason for hiding this comment

fitzgen Nov 18, 2020

Choose a reason for hiding this comment

alexcrichton commented Nov 18, 2020

jlb6740 commented Nov 19, 2020 • edited Loading

fitzgen commented Nov 30, 2020

fitzgen commented Dec 10, 2020

abrown commented Jan 6, 2021 • edited Loading

fitzgen commented Jan 11, 2021 • edited by yurydelendik Loading

Motion to finalize with a disposition to merge

Stakeholders sign-off

Fastly

IBM

Intel

Mozilla

sunfishcode Jan 11, 2021

Choose a reason for hiding this comment

fitzgen Jan 11, 2021

Choose a reason for hiding this comment

fitzgen commented Jan 11, 2021

Entering Final Call Period

lars-t-hansen commented Jan 12, 2021

tschneidereit commented Jan 12, 2021

fitzgen commented Jan 22, 2021

fitzgen commented Nov 18, 2020 •

edited

Loading

fitzgen Nov 18, 2020 •

edited

Loading

jlb6740 commented Nov 19, 2020 •

edited

Loading

abrown commented Jan 6, 2021 •

edited

Loading

fitzgen commented Jan 11, 2021 •

edited by yurydelendik

Loading