-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Msm optimization #29
Msm optimization #29
Conversation
Co-Authored-By: Brechtpd <Brechtp.Devos@gmail.com>
I have redone the benchmarks on my PC using this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-startover and by adding the new ASM from #49 and inversion algorithm from #83. Strangely, on my machine both your and @Brechtpd PR get slower than Zcash's. at 2²² sizes Full details, from commit 1fd2e54 on the left and with new assembly and Bernstein-Yang inversion on the right. Some details can be found in this branch https://github.com/taikoxyz/halo2curves/tree/research-msm-opt which was trying to rebase all PRs on top of
For references, here are the benchmark on my own library |
@mratsim Thanks for sharing this and your branch! (I did not want to try updating Brecht's PR to ff 0.13 etc...) I ran taikoxyz@25d80d7 bench I'll get some AWS machines to compare the |
BTW it should be worth noting that with 3x as many cores, zcash MSM was only 2x faster |
I'm at devconnect so can't investigate much but sounds to me like memory bandwidth bottleneck. The approach by Barretenberg needs sorting and lots of alloc, this is what I describe here in my deep dive: https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md |
This effort, at first, started to locate a discrepancy problem in pse/halo2/#40 and becomes a relatively simpler implementation.
Both implementations follows the batch addition approach with bucket method. Some promising techniques that are applied in #40 are not applied here yet. These are:
PrimeFieldBits
for base and scalar field of BN and Secp #40 and suggested by @Brechtpd. Currently this PR applies top-down parallelization as applied inzcash/halo2::arithmetic::best_multiexp
Without these optimisations performance seems like close to #40 where code size is aprox 5x smaller than that. This PR also copies other msm functions to compare this effort against #40 and zcash/msm which will be removed once this PR matures. Rough bench results below are performed in M1 machine.
#40
zcash
serialzcash
parallelthis
serialthis
parallelthis
takes over the lead#40
takes over the leadPasta curves haven't been covered yet since there is no
Point::from_xy_unchecked
kind of endpoint that would enable cheap construction of a point that is required in batch additions. A PR to zcash/pasta_curves planed to be opened to enable generic msm functionality here.Another limitation for this PR is that it assumes base points are independent and are not equal to infinity so that we can skip doubling and infinity checks which is the usual case in polynomial commitments.