Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make compress_block function public #13

Open
ordian opened this issue Sep 3, 2019 · 2 comments
Open

make compress_block function public #13

ordian opened this issue Sep 3, 2019 · 2 comments

Comments

@ordian
Copy link

ordian commented Sep 3, 2019

There are some needs to use the compression function from RFC 7693. Would you be willing to expose it in your awesome crate or extract it in a lower-level crate that other libs can use?

@oconnor663
Copy link
Owner

Very interesting. Yes, we could totally expose it. The current private compression APIs include some complexity that might not be relevant for your use case:

  • 2x/4x (or in the case of BLAKE2s, 4x/8x) parallel compression
  • an internal loop that compresses many blocks in a single call, to avoid paying the cost of vector transposition for every block
  • a stride parameter, to allow BLAKE2bp/BLAKE2sp to reuse the BLAKE2b/BLAKE2s code

It sounds you're only looking to expose 1x compression, in which case the looping optimization isn't necessary, because there's nothing to transpose. And you don't need the stride parameter. You presumably still want to do runtime feature detection (whether to use the portable.rs implementation or the avx2.rs implementation). Currently we assume you've done that once at a higher level and passed the result down as a parameter, which requires a different private API, but it's possible we could move that detail inside the compression API without a measurable loss in performance. (I think is_x86_feature_detected! boils down to a single Ordering::Relaxed load, but I haven't looked at the assembly output.) So we might be able to expose a simpler API than what's currently there.

One issue I notice in the spec you linked, is that the number of rounds is a free parameter. That clashes with this implementation in this crate, which hardcodes all the rounds so that they all get inlined. There are a number of different SIMD tricks that get deployed to load message words efficiently, which differ from round to round (here's the code). It's possible we could split all that out into a big switch statement, but I also worry that branch prediction isn't going to work well there, and that performance would suffer. That's the sort of thing I'd want to test very carefully.

A side note: API described in the spec you linked is a little less flexible than it could be, because while it takes a variable number of rounds, it always assumes it's starting from round 1. A more flexible API might be able to express something like "just do one round of compression, but make it round number 3."

Another side note: If you're exposing the BLAKE2b compression function, you might also want to consider exposing BLAKE2s. The relative performance of BLAKE2b and BLAKE2s on 64-bit CPUs is somewhat complicated once SIMD enters the picture, and BLAKE2b isn't always faster.

@ordian
Copy link
Author

ordian commented Sep 4, 2019

Thank you for the quick and very insightful reply!

Currently we assume you've done that once at a higher level and passed the result down as a parameter, which requires a different private API, but it's possible we could move that detail inside the compression API without a measurable loss in performance. (I think is_x86_feature_detected! boils down to a single Ordering::Relaxed load, but I haven't looked at the assembly output.) So we might be able to expose a simpler API than what's currently there.

There is ifunc trick to do cpu detection only on the first call.

One issue I notice in the spec you linked, is that the number of rounds is a free parameter. That clashes with this implementation in this crate, which hardcodes all the rounds so that they all get inlined. There are a number of different SIMD tricks that get deployed to load message words efficiently, which differ from round to round (here's the code). It's possible we could split all that out into a big switch statement, but I also worry that branch prediction isn't going to work well there, and that performance would suffer. That's the sort of thing I'd want to test very carefully.

I see, that's a valid concern and I agree the impact needs to be measured.

A side note: API described in the spec you linked is a little less flexible than it could be, because while it takes a variable number of rounds, it always assumes it's starting from round 1. A more flexible API might be able to express something like "just do one round of compression, but make it round number 3."

Another side note: If you're exposing the BLAKE2b compression function, you might also want to consider exposing BLAKE2s. The relative performance of BLAKE2b and BLAKE2s on 64-bit CPUs is somewhat complicated once SIMD enters the picture, and BLAKE2b isn't always faster.

I think the current spec tries to expose the bare minimum and can be extended later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants