Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scalar multiplication with a 4-bit window #61

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kayabaNerve
Copy link

This moves from 255 doubles and 255 additions to 259 doubles and 71 additions. If doubling is twice as fast, which is roughly the case as far as I can tell, this shifts the function from executing in (255 + (255 * 2)) = 765 time to (259 + (71 * 2)) = 401 time, a 48% speedup.

res = res.double();
}
// Add the bottom-nibble from this byte into the result
res += arr[usize::from(byte & 0b1111)];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like previous impl targets constant time multiplication but here there is an addition of infinity point where byte is zero which breaks such feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be constant time if the underlying functions are constant time. While it's a bit bold to assume to_repr is constant time, the prior impl also used it. If adding the identity isn't constant time, that's a fault in the addition operation, not this algorithm.

The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears addition isn't constant time. I believe that deserves its own issue. Here, I solely note the prior algorithm also performed addition with identity, so that isn't a new issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.

Ah sorry it's my bad to figure the prior impl as constant time

acc = $name::conditional_select(&acc, &(acc + self), bit);
// Create a table out of the point
// TODO: This can be made more efficient with a 5-bit window
let mut arr = [$name::identity(); 16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather large stack allocation of 1.5 kiB (and it would be even larger with a 5-bit window). This could affect the ability for this library to be used effectively (or at all) in no-std environments. I suspect at a minimum we would need a feature flag to switch back to the previous behaviour.

What we currently do in downstream crates (e.g. here) is use the group::WnafBase and group::WnafScalar types to implement this on the heap (and allow precomputing this table where it makes sense). Maybe we could instead document this approach?

Copy link
Author

@kayabaNerve kayabaNerve Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason not to simply document this is because every single * call is ~2x as slow as it needs to be. While I do ack how the stack allocation may be too large for nostd, I'd argue the solution is to make this function body std only, not to remove it.

WnafBase/WnafScalar, from what I understand, is intended for/makes sense for points frequently used in multiplication operatons. This isn't for the reuse case and I'd argue against using WnafBase here because it uses the heap.

I'd be happy to update the PR with a feature flag between this and the old body.

This moves from 255 doubles and 255 additions to 259 doubles and 71 additions.
If doubling is twice as fast, which is roughly the case as far as I can tell,
this shifts the function from executing in (255 + (255 * 2)) = 765 time to
(259 + (71 * 2)) = 401 time, a 48% speedup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants