align_to prefix max length not taken into account in optimization #72356

hsivonen · 2020-05-19T17:20:31Z

I tried this code:

pub fn count_non_ascii_sse2(buffer: &[u8]) -> u64 {
    let mut count = 0;
    let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
    for &b in prefix {
        if b >= 0x80 {
            count += 1;
        }
    }
    for &s in simd {
        count += unsafe {core::arch::x86_64::_mm_movemask_epi8(s)}.count_ones() as u64;
    }
    for &b in suffix {
        if b >= 0x80 {
            count += 1;
        }
    }
    count
}

Godbolt link

I expected to see this happen: I expected the compiler to conclude that prefix.len() < 16 and, therefore, not emit autovectorization for the first scalar loop.

Instead, this happened: The compiler emitted an autovectorization for the first scalar loop even though the prefix is never long enough for the autovectorization to be useful.

Meta

rustc --version --verbose:

rustc 1.45.0-nightly (a74d1862d 2020-05-14)

The text was updated successfully, but these errors were encountered:

workingjubilee · 2021-10-05T22:16:38Z

Does the equivalent C or C++ code perform this optimization with clang or gcc?

curiouscat2018 · 2022-06-24T03:41:13Z

What about adding assume intrinsic (and math to calculate length) inside buffer.align_to? This solves redundant vectorization according to experiment. It might be difficult to optimize on llvm side.

    let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
    unsafe {std::intrinsics::assume(prefix.len() < 16);}

https://rust.godbolt.org/z/3ePss8P8Y

thomcc · 2022-06-24T03:44:44Z

Does the equivalent C or C++ code perform this optimization with clang or gcc?

Yeah, in practice align_to often results in bad codegen because of this. Unfortunately, just adding an assume inside it isn't quite right, because under miri and in other cases it will all end up in the prefix slice.

This is one of the reasons I really dislike the align_to API (IMO it should return an Option in the case where it can't align, rather than putting everything in the prefix).

thomcc · 2022-06-27T21:09:54Z

(See also my comments in #53020 (comment), although the issue is otherwise unrelated)

This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.

…ark-Simulacrum Add a special case for align_offset /w stride != 1 This generalizes the previous `stride == 1` special case to apply to any situation where the requested alignment is divisible by the stride. This in turn allows the test case from rust-lang#98809 produce ideal assembly, along the lines of: leaq 15(%rdi), %rax andq $-16, %rax This also produces pretty high quality code for situations where the alignment of the input pointer isn’t known: pub unsafe fn ptr_u32(slice: *const u32) -> *const u32 { slice.offset(slice.align_offset(16) as isize) } // => movl %edi, %eax andl $3, %eax leaq 15(%rdi), %rcx andq $-16, %rcx subq %rdi, %rcx shrq $2, %rcx negq %rax sbbq %rax, %rax orq %rcx, %rax leaq (%rdi,%rax,4), %rax Here LLVM is smart enough to replace the `usize::MAX` special case with a branch-less bitwise-OR approach, where the mask is constructed using the neg and sbb instructions. This appears to work across various architectures I’ve tried. This change ends up introducing more branches and code in situations where there is less knowledge of the arguments. For example when the requested alignment is entirely unknown. This use-case was never really a focus of this function, so I’m not particularly worried, especially since llvm-mca is saying that the new code is still appreciably faster, despite all the new branching. Fixes rust-lang#98809. Sadly, this does not help with rust-lang#72356.

hsivonen added the C-bug Category: This is a bug. label May 19, 2020

thomcc mentioned this issue Jul 2, 2022

align_offset seems to generate worse code than is desirable (unless size_of::<T>() == 1) #98809

Closed

nagisa mentioned this issue Jul 3, 2022

Add a special case for align_offset /w stride != 1 #98866

Merged

workingjubilee added the A-autovectorization Issue related to autovectorization, which can impact perf or code size. label Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

align_to prefix max length not taken into account in optimization #72356

align_to prefix max length not taken into account in optimization #72356

hsivonen commented May 19, 2020

workingjubilee commented Oct 5, 2021

curiouscat2018 commented Jun 24, 2022

thomcc commented Jun 24, 2022

thomcc commented Jun 27, 2022

align_to prefix max length not taken into account in optimization #72356

align_to prefix max length not taken into account in optimization #72356

Comments

hsivonen commented May 19, 2020

Meta

workingjubilee commented Oct 5, 2021

curiouscat2018 commented Jun 24, 2022

thomcc commented Jun 24, 2022

thomcc commented Jun 27, 2022