memo: improvement list #15

ashWhiteHat · 2021-12-03T06:48:23Z

I noted improvement list in order not to forget.

Thank you Onur for giving ideas!

The text was updated successfully, but these errors were encountered:

barryWhiteHat · 2021-12-03T10:37:33Z

Can we move this to halo2 repo ?

ashWhiteHat · 2021-12-03T11:38:25Z

Yes!

lispc · 2021-12-09T07:02:17Z

@noctrlz i posted some cpu profiling graph here. zcash#418 (comment). It seems Domain::rotate_extended costs lots of time. Maybe another big issue we need to solve later.

ashWhiteHat · 2021-12-10T03:51:24Z

@lispc
Thank you for nice feed back!
I added it to description.

Brechtpd · 2021-12-17T22:04:40Z

Some of my initial ideas after looking at the code a bit and things I plan on looking into further and implementing. Will have to implement some of these things to know how big of different it makes to know if they are worth it or not.

Most time spent in evaluation the expressions, FFT, and multiexp. Shameless plug to my post about my libsnark optimizations but many of what is said there should also be applicable here.

FFT

Make a recursive version like Loopring/libfqfft@1d8db42
Often an FFT is done and then an extra factor/operation is applied. It's better to
- apply the additional computation within the FFT for cache reasons (otherwise have to load everything into cache again)
- precompute with the twiddles/... where possible
- Often the FFT is done where a lot (like 95%) of the last values are 0. This could probably be exploited for some moderate gains.

Multi exp

Make it work more like Loopring/libff@5e87e58
Experiment with c value, I previously found 16 to work great for some reason but of course may be circuit dependent
Add prefetching, which should make a nice difference with little work
Experiment with density maps, which should help when not working on cosets
Add density maps when not doing the multi exp over the cosets.

Evaluations

Probably the part that will need the most experimentation because these big polynomials are pretty unique to zkEVM I think.
Current approach is to parallelize on the math operation inside the expression, but I think this may not be the best approach on the CPU.
- I don't think it's cache efficient and needs larger temporary buffers.
- Does not allow reusing intermediate results efficiently
- Doesn't directly allow efficient early out scenarios (which I think in the zkEVM may be a big downside with how the circuits work)
With how the zkEVM expressions lots of parts are not used when a specific opcode is run, I think this may open op skipping over parts of the expression in most cases. A naive but general way to exploit this could be to e.g. change a bit how multiplication works. First do the cheapest term, and only when that term isn't zero do the more expensive term. Of course if this ends up working much better schemes can be found (e.g. prover checks which opcode is being done and evaluates only the necessary parts efficiently). Also only every 1 row in STEP_HEIGHT is actually used to constrain values (I assume a lot of these expressions will be 0 but not sure about that) so we may be able to skip a majority of the rows if that's the case. This works for the normal evalutions but not the ones in the extended domain over cosets (which is the majority of the prover time).
Most time spent on operations over the same extended domain, may or may not be always necessary to extend the domain that much (already an issue open for this so will not focus on this).

Will have to test against actual transactions to see if some of these assumptions hold or not, currently most values are 0 because the bench circuit isn't filled with actual transactions so don't want to base too much on the current bench.

Current status: Biggest gains are from reusing intermediate results across expressions. This also impacts memory use a lot because of temporary buffers so optimizing both at the same time. Very easy to reduce memory by 50%-60% with practically no changes (without counting the memory savings by not using rotate_extended for the evaluations which is also potentially very high). Optimizations across expressions up to 5x faster than just hardcoding the expressions (which was already 3x faster). Currently generating the h(x) poly largely on the fly, fast and memory efficient! (but there is even more room when when trading in some performance for even further memory use reduction. How big this impact I don't know yet.

Some stuff I quickly tested a bit to have more of a feeling how this behaves:

Misc

We don't need any zero knowledge. For groth16 that meant some significant work wasn't necessary to compute in the prover, not sure if the same is true in PLONK.
A lot of memory allocs are being done in the prover, better to reuse buffer where possible
Because there are many FFTs/multi exps it may be interesting to see what the best parallellization strategy is (inside an FFT/multiexp or just on the FFT/multiexp level) but I think that may only really impact smaller circuits because otherwise there should be enough work per thread so shouldn't matter that much.
Faster field math always better of course (but already being done). :)

ashWhiteHat · 2021-12-17T23:24:25Z

@Brechtpd
Thank you!
Let me check and I am going to work on that 😎

Brechtpd · 2022-01-14T21:41:40Z

Two important areas to optimize on the circuits side are

The max degree, which greatly impacts the prover performance because a lot of calculations need to be done in the extended domain. How big the extended domain is depends on the max degree (currently the extended domain is x16 bigger).
The number of lookups, which are heavier on the prover than the custom gates.

We currently have a good idea how to greatly reduce 2) by making use of the intermediate rows (the rows between the step height) by having each row do a limited number of lookups and in the custom gates use those cells to verify the data we would normally lookup.

Because lookups add 3 degrees to the input expression, having this indirection with a lookup always having a simple table cell as input also means we have a lower degree requirement (because the input degree is always 1).

For the custom gates it should always be possible to get the degree as low as needed by making use of intermediate cells (which I guess could even be done automatically), so the theoretical minimum degree we can achieve is strictly the one imposed by the lookups.

The current minimum degree for lookups (between advice cells) is 4. This would put the lower limit on the extended domain to 4x the normal size. However I think it may be possible to lower this to 3 by removing the extra terms required to have zero knowledge (which for zkEVM we don't need anyway):

Without zero knowledge I believe we can drop the (l_last + l_blind) term which lowers the required degree by 1. Can anyone help confirm if this is correct or not? I would think this wouldn't have any impact on anything else but I don't know much about the math so don't know for sure.

If this is true the theoretical lowest degree needed is just 3 and so the extended domain would only be 2x the size of the normal domain, otherwise it's x4 which is still pretty good I guess. :) Adding a selector to the lookups also increases the degree, so if we need the degree to be at 3 we would need dedicated cells that do the lookups on each row without exception (which shouldn't be a problem).

han0110 · 2022-01-15T06:16:04Z

Without zero knowledge I believe we can drop the (l_last + l_blind) term which lowers the required degree by 1. Can anyone help confirm if this is correct or not?

I think it won't break anything, but would like to hear @therealyingtong's thought.

Adding a selector to the lookups also increases the degree, so if we need the degree to be at 3 we would need dedicated cells that do the lookups on each row without exception (which shouldn't be a problem).

Ideally we don't need to always shuffle the input and table in the same grand product IIUC, we can shuffle input and table separately with their own grand products to take down the degree to 2. If we really need selector on input to make layout easier, then I think this is a possible solution.

lispc · 2022-01-20T03:26:21Z

upstream halo2 introduced AST for ploy and performance get improved. (details : zcash#447 (comment)) We may need to merge the upstream later.

ashWhiteHat · 2022-01-25T12:28:39Z

Introducing assembly was done on privacy-scaling-explorations/pairing#5.
It improved prove and setup slightly, and CPU dramatically as following.
privacy-scaling-explorations/pairing#5 (comment)

ashWhiteHat · 2022-01-26T07:44:12Z

Assembly Optimization In The Future

Replace less latency instruction according to here
Apply Differential Addition Chains to add and sub, double
Apply CIOS to mul and square

goff implementation will be hint.

ashWhiteHat · 2022-02-08T06:57:21Z

upstream halo2 introduced AST for ploy and performance get improved. (details : zcash#447 (comment)) We may need to merge the upstream later.

Rebase and introduce assembly reduce prover time 72.02%.
privacy-scaling-explorations/zkevm-circuits#302 (comment)

book: Note that we use 0 for uncommitted leaves in the commitment tree

ashWhiteHat · 2022-04-12T09:22:00Z

Optimize poly evaluation and fft reduce prover time 75.6%.
privacy-scaling-explorations/zkevm-circuits#396 (comment)

ashWhiteHat · 2022-04-25T10:25:41Z

…-hk/dev-fix/address-pr-15 Resolve "Address PR privacy-scaling-explorations#15 issues"

davidnevadoc · 2024-06-11T15:39:37Z

Most of these improvements now apply to primitives that are implemented in halo2curves so the issue is no longer relevant.

ashWhiteHat transferred this issue from privacy-scaling-explorations/zkevm-circuits Dec 6, 2021

Brechtpd mentioned this issue Jan 7, 2022

WIP optimizations Brechtpd/halo2#1

Draft

barryWhiteHat mentioned this issue Jan 24, 2022

benchmark and start to optimise halo2 prover using ZKG polynomial commitment scheme privacy-scaling-explorations/zkevm-circuits#60

Closed

ashWhiteHat mentioned this issue Jan 28, 2022

[WIP] Optimization From Zcash Latest Branch #23

Closed

4 tasks

kilic pushed a commit that referenced this issue Feb 12, 2022

Merge pull request #15 from zcash/book-uncommitted-leaves

d4c9258

book: Note that we use 0 for uncommitted leaves in the commitment tree

ashWhiteHat mentioned this issue Feb 22, 2022

fft optimization #36

Closed

4 tasks

iquerejeta pushed a commit to input-output-hk/halo2 that referenced this issue May 8, 2024

Merge pull request privacy-scaling-explorations#168 from input-output…

bcae56b

…-hk/dev-fix/address-pr-15 Resolve "Address PR privacy-scaling-explorations#15 issues"

adria0 added the enhancement New feature or request label Jun 10, 2024

davidnevadoc closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memo: improvement list #15

memo: improvement list #15

ashWhiteHat commented Dec 3, 2021 •

edited

Loading

barryWhiteHat commented Dec 3, 2021

ashWhiteHat commented Dec 3, 2021

lispc commented Dec 9, 2021

ashWhiteHat commented Dec 10, 2021

Brechtpd commented Dec 17, 2021 •

edited

Loading

ashWhiteHat commented Dec 17, 2021

Brechtpd commented Jan 14, 2022

han0110 commented Jan 15, 2022 •

edited

Loading

lispc commented Jan 20, 2022

ashWhiteHat commented Jan 25, 2022

ashWhiteHat commented Jan 26, 2022 •

edited

Loading

ashWhiteHat commented Feb 8, 2022

ashWhiteHat commented Apr 12, 2022

ashWhiteHat commented Apr 25, 2022 •

edited

Loading

davidnevadoc commented Jun 11, 2024

memo: improvement list #15

memo: improvement list #15

Comments

ashWhiteHat commented Dec 3, 2021 • edited Loading

barryWhiteHat commented Dec 3, 2021

ashWhiteHat commented Dec 3, 2021

lispc commented Dec 9, 2021

ashWhiteHat commented Dec 10, 2021

Brechtpd commented Dec 17, 2021 • edited Loading

FFT

Multi exp

Evaluations

Misc

ashWhiteHat commented Dec 17, 2021

Brechtpd commented Jan 14, 2022

han0110 commented Jan 15, 2022 • edited Loading

lispc commented Jan 20, 2022

ashWhiteHat commented Jan 25, 2022

ashWhiteHat commented Jan 26, 2022 • edited Loading

ashWhiteHat commented Feb 8, 2022

ashWhiteHat commented Apr 12, 2022

ashWhiteHat commented Apr 25, 2022 • edited Loading

Category

Memo

davidnevadoc commented Jun 11, 2024

ashWhiteHat commented Dec 3, 2021 •

edited

Loading

Brechtpd commented Dec 17, 2021 •

edited

Loading

han0110 commented Jan 15, 2022 •

edited

Loading

ashWhiteHat commented Jan 26, 2022 •

edited

Loading

ashWhiteHat commented Apr 25, 2022 •

edited

Loading