Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
blake3_impl.c and blake3module.c are adapted from the existing BLAKE2 module. This involves a lot of copy-paste, and hopefully someone who knows this code better can help me clean them up. (In particular, BLAKE2 relies on clinic codegen to share code between BLAKE2b and BLAKE2s, but BLAKE3 has no need for that.) blake3_dispatch.c, which is vendored from upstream, includes runtime CPU feature detection to choose the appropriate SIMD instruction set for the current platform (x86 only). In this model, the build should include all instruction sets, and here I unconditionally include the Unix assembly files (*_unix.S) as `extra_objects` in setup.py. This "works on my box", but is currently incomplete in several ways: - It needs some Windows-specific build logic. There are two additional assembly flavors included for each instruction set, *_windows_gnu.S and *_windows_msvc.asm. I need to figure out how to include the right flavor based on the target OS/ABI. - I need to figure out how to omit these files on non-x86-64 platforms. x86-32 will require some explicit preprocessor definitions to restrict blake3_dispatch.c to portable code. (Unless we vendor intrinsics-based implementations for 32-bit support. More on this below.) - It's not going to work on compilers that are too old to recognize these instruction sets, particularly AVX-512. (Question: What's the oldest GCC version that CPython supports?) Maybe compiler feature detection could be added to ./configure and somehow plumbed through to setup.py. I'm hoping someone more experienced with the build system can help me narrow down the best solution for each of those. This also raises the higher level question of whether the CPython project feels comfortable about including assembly files in general. As a possible alternative, the upstream BLAKE3 project also provides intrinsics-based implementations of the same optimizations. The upsides of these are 1) that they don't require Unix/Windows platform detection, 2) that they support 32-bit x86 targets, and 3) that C is easier to audit than assembly. However, the downsides of these are 1) that they're ~10% slower than the hand-written assembly, 2) that their performance is less consistent and worse on older compilers, and 3) that they take noticeably longer to compile. We recommend the assembly implementations for these reasons, but intrinsics are a viable option if assembly violates CPython's requirements.
- Loading branch information